CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

umi_tools extract error #552

Closed fpengstudy closed 2 years ago

fpengstudy commented 2 years ago

Hi! Thank you for create such a useful tool.

I used umi_tools as my standard for extracting umi and barcode. But during the process of this software, I have encountered some errors that make my program stop only halfway through running, and I would like to know what the error is due to.

umi_tools extract --bc-pattern=CCCCCCCCCNNNNN --stdin SRR12395858_1.fastq.gz --stdout SRR12395858_1_extracted.fastq --read2-in SRR12395858_2.fastq.gz --read2-out=SRR12395858_2_extracted.fastq --error-correct-cell --whitelist=whitelist_9.txt
# UMI-tools version: 1.1.2
# output generated by extract --bc-pattern=CCCCCCCCCNNNNN --stdin SRR12395858_1.fastq.gz --stdout SRR12395858_1_extracted.fastq --read2-in SRR12395858_2.fastq.gz --read2-out=SRR12395858_2_extracted.fastq --error-correct-cell --whitelist=whitelist_9.txt
# job started at Mon Aug 15 20:04:11 2022 on login -- f8995c51-5890-4f7a-b4e6-44de8a373586
# pid: 29165, system: Linux 2.6.32-504.el6.x86_64 #1 SMP Wed Oct 15 04:27:16 UTC 2014 x86_64
# blacklist                               : None
# compresslevel                           : 6
# correct_umi_threshold                   : 0
# either_read                             : False
# either_read_resolve                     : discard
……………………
……………………
2022-08-15 20:18:33,717 INFO Parsed 56500000 reads
2022-08-15 20:18:35,225 INFO Parsed 56600000 reads
2022-08-15 20:18:36,736 INFO Parsed 56700000 reads
2022-08-15 20:18:38,253 INFO Parsed 56800000 reads
2022-08-15 20:18:39,764 INFO Parsed 56900000 reads
2022-08-15 20:18:41,283 INFO Parsed 57000000 reads
2022-08-15 20:18:42,386 INFO Input Reads: 57072999
2022-08-15 20:18:42,386 INFO Filtered cell barcode. Not correctable: 56018271
2022-08-15 20:18:42,386 INFO Reads output: 1054728
2022-08-15 20:18:42,386 INFO False cell barcode. Error-corrected: 731154
# job finished in 870 seconds at Mon Aug 15 20:18:42 2022 -- 858.92 13.21  0.00  0.00 -- f8995c51-5890-4f7a-b4e6-44de8a373586

Thank you !

fpengstudy commented 2 years ago

my error was in 2022-08-15 20:18:42,386 INFO False cell barcode. Error-corrected: 731154

IanSudbery commented 2 years ago

Hi,

The program didn't stop - it completed normally. See # job finished in 870 seconds.

The line you highlight is not an error in with the program, but rather a report that extract found 731,154 cell barcodes that were not in the whitelist but was able to error-correct to barcodes that were in the whitelist. However, it also found > 56million cell barcodes that were not in the whitelist and it was not able to error-correct to a barcode that is in the whitelist:

2022-08-15 20:18:42,386 INFO Filtered cell barcode. Not correctable: 56018271

This would suggest to me that there is something wrong with one of: The barcode pattern, the whitelist or the data, leaving you with only 1,054,728 reads out of 57,072,999 - 731,1354 (or ~ 75%) didn't have a barcode in the white list, but could be corrected to one.

I particular, its rather odd to have a 9nt cell barcode, but only a 5nt UMI

fpengstudy commented 2 years ago

Thanks for your answer!

fpengstudy commented 2 years ago

Thank you for your reply!I am new to this field. I wanted to use reads to determine which cell they came from, so I used this tool. My sample is on the CELLSEQ2 platform, and Umi and barcode base numbers are bothering me. Thank you very much for your reply! I will continue to study the relevant principles.

IanSudbery commented 2 years ago

According to the CEL_seq2 paper, the CEL-seq2 cell barcode is 6nt, and the UMI is 6nt.