Request for info: --allreads and default parameters

josiah-liew commented 6 months ago

Thank you for this great tool! We have been doing viral phylogenetics and this has been great. I wanted to gain a deeper understanding re: --allreads and default parameters. We've had more consensi sequences (rightly) generated when using --allreads vs. running default parameters.

Would it be right to say: low abundance consensi (<2% of total reads) is lost in default settings of 1000 reads/batch? Said another way, default parameters require/expect towards a more even distribution of consensi and distributions of very high-very low may results in consensi being missed?

Thank you again.

avierstr commented 6 months ago

Thanks for the appreciation :-)

Default setting are still ok to pick up low abundant species if your dataset is big enough or the number of species low. If you only have 2000 reads, than it is more difficult to pick up low abundant species. Default it is using 10000 reads and this is enough for samples with only a few species in it. If you have a lot of species in your sample, it is better to increase that value or use the --allreads option.
In my testruns when I have published the tool, it was picking up species with 1.5% reads in the samples (with 7 species in the sample).

If you have 2% reads of a certain species in the dataset, that means 20 in the 1000 reads batch. If the quality of the read is sufficient (similarity between those 20 is above the estimated "--similar_species_groups" (around 93-96% ), then it will pick it up. (The read quality of Nanopore reads form a Gaussian curve.) But if those 20 are lower quality reads, it is possible that it is not able to form a species group while processing the batches and you are missing those species.

There are a few possibilities to increase the possibility to pick them up: -run amplicon_sorter on higher quality reads (above Q10 or Q12 if you have enough reads) -lower the value --similar_species_groups (but than you increase the chance that closely related species are merged) -run amplicon_sorter a few times on the same dataset with the option --random to increase the chance that a few reads are found that can form a species group. (will take some more processing time) -run amplicon_sorter with the option --all to compare all reads with each other (very time consuming on larger datasets)

Best regards, Andy

josiah-liew commented 6 months ago

Thank you, Andy! And for the suggestions too. This definitely clarifies and I'll do a few test runs per your suggestions.

Cheers, Josiah

avierstr / amplicon_sorter

Request for info: --allreads and default parameters #18