I calculated the crossmapping between the transcriptomes to assess and have an idea of how common it is and its distribution.
We have closely related transcriptomes (RCC1587 and RCC1693), somewhat related (RCC1007) and outgroups (Pelagomonas calceolata).
The crossmap is calculated by evaluating the BAMs filtered by 95% identity and 80% of the read mapping.
The comparison shows clearly that the distribution follows the matrix of similarity discussed here.
It would not make sense to remove all the genes presenting a large crossmapping with the closest species, since these are indeed difficult to dinsentangle. A naive approach could be to divide by 3 the read count value for this case, but we do not know how well represented we have the genus.
Generally I would approach it removing all the genes that crossmap with Pelagomonas and RCC1007 and accept that the mappings are a composite of the genus.
Removing genes present in > 90% samples to not account conflated mappings performed in [the Pelagomonas study] (https://www.nature.com/articles/s42003-022-03939-z.pdf) indeed removed 182 genes presenting an important crossmapping out of a total of 353 with large crossmapping. This points out that it is insufficient to remove genes with noise.
Most of the genes discarded present what we should expect as an Eggnog annotation:
The mapping methods against a single transcriptome will overinflate some genes that are presenting a small evolutionary divergence, joining closely related species from the same genus.
Some genes will be conserved enough to crossmap even more far away.
By having some idea of the relationships between the genomes, we can establish an outer comparison, remove on the basis of that and keeping this layer of information to better understand weird distributions of expression.
We still have to see how all of this would be easily avoided with phylogenetic placement.
I calculated the crossmapping between the transcriptomes to assess and have an idea of how common it is and its distribution.
We have closely related transcriptomes (RCC1587 and RCC1693), somewhat related (RCC1007) and outgroups (Pelagomonas calceolata).
The crossmap is calculated by evaluating the BAMs filtered by 95% identity and 80% of the read mapping.
The comparison shows clearly that the distribution follows the matrix of similarity discussed here.
It would not make sense to remove all the genes presenting a large crossmapping with the closest species, since these are indeed difficult to dinsentangle. A naive approach could be to divide by 3 the read count value for this case, but we do not know how well represented we have the genus.
Generally I would approach it removing all the genes that crossmap with Pelagomonas and RCC1007 and accept that the mappings are a composite of the genus.
Removing genes present in > 90% samples to not account conflated mappings performed in [the Pelagomonas study] (https://www.nature.com/articles/s42003-022-03939-z.pdf) indeed removed 182 genes presenting an important crossmapping out of a total of 353 with large crossmapping. This points out that it is insufficient to remove genes with noise.
Most of the genes discarded present what we should expect as an Eggnog annotation: