CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

reads limit of fastq file #587

Closed yuw444 closed 1 year ago

yuw444 commented 1 year ago

Hi,

I am using whitelist subcommand and found the below log.

...
2023-03-03 15:31:38,989 INFO Parsed 99500000 reads
2023-03-03 15:31:39,570 INFO Parsed 99600000 reads
2023-03-03 15:31:40,167 INFO Parsed 99700000 reads
2023-03-03 15:31:40,748 INFO Parsed 99800000 reads
2023-03-03 15:31:41,337 INFO Parsed 99900000 reads
2023-03-03 15:31:41,907 INFO Parsed 100000000 reads
2023-03-03 15:31:41,907 INFO Starting - whitelist determination
2023-03-03 15:31:44,498 INFO Finished - whitelist determination
2023-03-03 15:31:44,499 INFO Starting - finding putative error cell barcodes
2023-03-03 15:31:44,499 INFO building bktree
2023-03-03 15:31:44,511 INFO done building bktree
2023-03-03 15:37:45,771 INFO Finished - finding putative error cell barcodes
2023-03-03 15:37:45,771 INFO Top 2052 cell barcodes passed the selected threshold
2023-03-03 15:37:45,771 INFO Writing out whitelist
2023-03-03 15:37:45,863 INFO Parsed 100000001 reads
2023-03-03 15:37:45,863 INFO 100000001 reads matched the barcode pattern
2023-03-03 15:37:45,863 INFO Found 2612577 unique cell barcodes
2023-03-03 15:37:45,863 INFO Found 39600555 total reads matching the selected cell barcodes
2023-03-03 15:37:45,863 INFO Found 759991 total reads which can be error corrected to the selected cell barcodes
# job finished in 951 seconds at Fri Mar  3 15:37:45 2023 -- 944.20 10.29  0.00  0.00

Does this mean umi-tools only parsed the first 100000001 reads, and the rest of them is untouched?

TomSmithCGAT commented 1 year ago

Yes, that’s correct. By default, only the first 100M reads are processed. That’s enough to estimate the true cell barcodes. You can increase the number of reads processed with --subset-reads if you wish.

yuw444 commented 1 year ago

Thanks so much for making this configurable.