Random sampling - Githubissues

avierstr / amplicon_sorter

Sorts amplicons from Nanopore sequencing data based on similarity

31 stars 7 forks source link

Hi ettorefedele, Using the -all option on 800.000 reads is EXTREMELY timeconsuming: it will compare all reads with each other, so read 1 with 2, 3, ... 800.000. Next it will compare read 2 with 3, 4, 5, ... 800.000 and so on. It is not advised to do that for such a huge dataset.

By default, amplicon_sorter is comparing batches of 1000 reads with each other. So it will take reads 1-1000 and compare them, take reads 1001-2000 and compare them and so on. With the random option it will do the same thing but not in sequential order. If you do that 8x, it increases the chance of picking up rare barcodes. To use that you need the random option -ra and calculate the maximum number of reads to use: 8 x 800.000 = 6.400.000. command: python3 amplicon_sorter.py -i infut.fastq -ra -maxr 6400000

avierstr / amplicon_sorter

Random sampling #19