Crossmapping evaluation in Florenciella to take a decision regarding each gene

I calculated the crossmapping between the transcriptomes to assess and have an idea of how common it is and its distribution.

We have closely related transcriptomes (RCC1587 and RCC1693), somewhat related (RCC1007) and outgroups (Pelagomonas calceolata).

The crossmap is calculated by evaluating the BAMs filtered by 95% identity and 80% of the read mapping.

The comparison shows clearly that the distribution follows the matrix of similarity discussed here.

It would not make sense to remove all the genes presenting a large crossmapping with the closest species, since these are indeed difficult to dinsentangle. A naive approach could be to divide by 3 the read count value for this case, but we do not know how well represented we have the genus.

Generally I would approach it removing all the genes that crossmap with Pelagomonas and RCC1007 and accept that the mappings are a composite of the genus.

Removing genes present in > 90% samples to not account conflated mappings performed in [the Pelagomonas study] (https://www.nature.com/articles/s42003-022-03939-z.pdf) indeed removed 182 genes presenting an important crossmapping out of a total of 353 with large crossmapping. This points out that it is insufficient to remove genes with noise.

Most of the genes discarded present what we should expect as an Eggnog annotation:

beaplab / transcriptome_metaT_quantification

Crossmapping evaluation in Florenciella to take a decision regarding each gene #3