bonsai-team / matam

Mapping-Assisted Targeted-Assembly for Metagenomics
GNU Affero General Public License v3.0
19 stars 9 forks source link

How to interpret the count= in the final assembly #75

Closed fanhuan closed 5 years ago

fanhuan commented 5 years ago

Hi,

What I understand is that this count is the result of the abundance estimation and each read is weighted based on how many scaffolds it is contributing to. But I've seen many with count lower than 1. Should we get rid of them for quality control purpose?

Thanks ahead.

ppericard commented 5 years ago

Hi @fanhuan,

Short answer, you can probably filter out scaffolds with count < 1.

For the quantification step, we remap the rRNA reads onto the scaffolds with SortMeRNA, and we allow a read to align on up to 10 different scaffolds. Therefore, if a read is very specific to a single specie/strain/sequence it will only align on a single scaffold. On the contrary, if a read comes from a very conserved region, it could be aligned with a similar score on many different scaffolds.

Our quantification algorithm tries to take into account that variability in sequence conservation profile by weighting each read inversely proportionally to the number of good alignments (having scores greater of equal than 99% of the best alignment score for that read). If a read was aligned on a single scaffold, it will add 1 to the count for this scaffold. If a read was aligned on several (n) scaffolds, we add 1/n to the count of each scaffold. So for example, if a read was aligned on 10 scaffolds, we would add 0.1 (1/10) to the count of each of these scaffolds.

When comparing the quantification obtained with this method to simulated datasets with known abundances we showed that our quantification method gave results much closer to the truth than the quantification obtained with EMIRGE (Table 2 of the paper).

As for scaffolds with count lower than 1, it means that those scaffold are supported by only a few reads that are also mapped onto other scaffolds. The only risk in removing those is that if you have a very low abundance specie and that for some reason there is still some redundancy in the scaffolds, then you might drop a real sequence. But this seems unlikely, and you probably can filter out those scaffolds without loosing any relevant info. We might even add that filter to the pipeline so that those scaffolds are filtered out automatically.

Cheers Pierre

fanhuan commented 5 years ago

Thank you very much for the detailed explanation Pierre. I look forward to the new enhancement.