Open ettorefedele opened 1 month ago
Hi ettorefedele, Using the -all option on 800.000 reads is EXTREMELY timeconsuming: it will compare all reads with each other, so read 1 with 2, 3, ... 800.000. Next it will compare read 2 with 3, 4, 5, ... 800.000 and so on. It is not advised to do that for such a huge dataset.
By default, amplicon_sorter is comparing batches of 1000 reads with each other. So it will take reads 1-1000 and compare them, take reads 1001-2000 and compare them and so on. With the random option it will do the same thing but not in sequential order. If you do that 8x, it increases the chance of picking up rare barcodes. To use that you need the random option -ra and calculate the maximum number of reads to use: 8 x 800.000 = 6.400.000. command: python3 amplicon_sorter.py -i infut.fastq -ra -maxr 6400000
Thank you for this very useful tool! I have been using it for DNA metabarcoding of analysis of several markers (e.g. COI, trnl, and 16S) with eDNA samples in multiplex reactions and has proved great. However, I do have a question concerning random sampling of reads. Lets say I have a fastq containing nearly 800.000 reads (1 sample), I would like to use the -all option to ensure that even rare barcodes are detected while also using (e.g.) 8x random sampling to reduce the computation time, as described in the reference article. I am not entirely sure I understood how to code this properly. Thank you very much and I apologise for the trivial question.