avierstr / amplicon_sorter

Sorts amplicons from Nanopore sequencing data based on similarity
31 stars 7 forks source link

Random sampling #19

Open ettorefedele opened 1 month ago

ettorefedele commented 1 month ago

Thank you for this very useful tool! I have been using it for DNA metabarcoding of analysis of several markers (e.g. COI, trnl, and 16S) with eDNA samples in multiplex reactions and has proved great. However, I do have a question concerning random sampling of reads. Lets say I have a fastq containing nearly 800.000 reads (1 sample), I would like to use the -all option to ensure that even rare barcodes are detected while also using (e.g.) 8x random sampling to reduce the computation time, as described in the reference article. I am not entirely sure I understood how to code this properly. Thank you very much and I apologise for the trivial question.

avierstr commented 1 month ago

Hi ettorefedele, Using the -all option on 800.000 reads is EXTREMELY timeconsuming: it will compare all reads with each other, so read 1 with 2, 3, ... 800.000. Next it will compare read 2 with 3, 4, 5, ... 800.000 and so on. It is not advised to do that for such a huge dataset.

By default, amplicon_sorter is comparing batches of 1000 reads with each other. So it will take reads 1-1000 and compare them, take reads 1001-2000 and compare them and so on. With the random option it will do the same thing but not in sequential order. If you do that 8x, it increases the chance of picking up rare barcodes. To use that you need the random option -ra and calculate the maximum number of reads to use: 8 x 800.000 = 6.400.000. command: python3 amplicon_sorter.py -i infut.fastq -ra -maxr 6400000