CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
482 stars 190 forks source link

error in the step of umi_tools extract #289

Closed zhuxqdoctor closed 5 years ago

zhuxqdoctor commented 5 years ago

Hi all. I have tried to run umi tools on 10x genomics (Chromium™ Single Cell 3’ v2) scRNA data and my read 1 is just like this: @C00135:251:CB8E5ANXX:8:1101:1199:2066 1:N:0:GCATCTCC CCTTCCCAGTCCAGGATAATCGGCAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGTTTTGAAAAAAAATTTTTTTTTTTTTTGGGGGGGGGGGGGGG + BBBBBFFBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF////<//////<7<B///B7BB<</BFB//////BFFFF<</B// This contained 16bp cb and 10 UMIs following poly T. Based on this, I modified bc pattern like this: --bc-pattern='(?P.{16})(?P.{10})T{3}.*'. This was fine for step of 'umi_tools whitelist' but got errors while using 'umi_tools extract' as following. I also tried the pattern of --bc-pattern='(?P.{16})(?P.{10})' and got the same error. Can anyone give me any suggestions. Thanks.

job started at Fri Oct 26 13:08:37 2018 on ctcf -- a7ca481d-f419-4324-b75e-07e8d9b45fed

pid: 173319, system: Linux 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64

blacklist : None

compresslevel : 6

error_correct_cell : False

extract_method : string

filter_cell_barcode : True

filter_cell_barcodes : False

log2stderr : False

loglevel : 1

pattern : (?P.{16})(?P.{10})

pattern2 : None

prime3 : None

quality_encoding : None

quality_filter_mask : None

quality_filter_threshold : None

random_seed : None

read2_in : SYL1_S1_L008_R2_001.fastq.gz

read2_out : SYL1_S1_L008_R2_001_extracted.fastq.gz

read2_stdout : False

reads_subset : None

reconcile : False

retain_umi : None

short_help : None

stderr : <open file '', mode 'w' at 0x7fc28478b1e0>

stdin : <gzip open file 'SYL1_S1_L008_R1_001.fastq.gz', mode 'rb' at 0x7fc264ab3c90 0x7fc264a13750>

stdlog : <open file '', mode 'w' at 0x7fc28478b150>

stdout : <gzip open file 'SYL1_S1_L008_R1_001_extracted.fastq.gz', mode 'wb' at 0x7fc264ab3d20 0x7fc28464be50>

timeit_file : None

timeit_header : None

timeit_name : all

whitelist : whitelist.txt

2018-10-26 13:08:37,092 ERROR barcode pattern(s) do not include any umi bases (marked with 'Ns') (?P.{16})(?P.{10}), None Traceback (most recent call last): File "/usr/local/anaconda2/bin/umi_tools", line 11, in sys.exit(main()) File "/usr/local/anaconda2/lib/python2.7/site-packages/umi_tools/umi_tools.py", line 59, in main module.main(sys.argv) File "/usr/local/anaconda2/lib/python2.7/site-packages/umi_tools/extract.py", line 264, in main options.pattern, options.pattern2)) File "/usr/local/anaconda2/lib/python2.7/site-packages/umi_tools/Utilities.py", line 1118, in error raise ValueError("UMI-tools failed with an error. Check the log file") ValueError: UMI-tools failed with an error. Check the log file

IanSudbery commented 5 years ago

Hi. You need to set the --extract_method=regex option.

On Fri, 26 Oct 2018, 6:26 am zhuxqdoctor, notifications@github.com wrote:

Hi all. I have tried to run umi tools on 10x genomics (Chromium™ Single Cell 3’ v2) scRNA data and my read 1 is just like this: @C00135:251:CB8E5ANXX:8:1101:1199:2066 1:N:0:GCATCTCC

CCTTCCCAGTCCAGGATAATCGGCAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGTTTTGAAAAAAAATTTTTTTTTTTTTTGGGGGGGGGGGGGGG +

BBBBBFFBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF////<//////<7<B///B7BB<</BFB//////BFFFF<</B// This contained 16bp cb and 10 UMIs following poly T. Based on this, I modified bc pattern like this: --bc-pattern='(?P.{16})(?P.{10})T{3}.*'. This was fine for step of 'umi_tools whitelist' but got errors while using 'umi_tools extract' as following. I also tried the pattern of --bc-pattern='(?P.{16})(?P.{10})' and got the same error. Can anyone give me any suggestions. Thanks. job started at Fri Oct 26 13:08:37 2018 on ctcf -- a7ca481d-f419-4324-b75e-07e8d9b45fed pid: 173319, system: Linux 4.4.0-116-generic #140 https://github.com/CGATOxford/UMI-tools/issues/140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 blacklist : None compresslevel : 6 error_correct_cell : False extract_method : string filter_cell_barcode : True filter_cell_barcodes : False log2stderr : False loglevel : 1 pattern : (?P.{16})(?P.{10}) pattern2 : None prime3 : None quality_encoding : None quality_filter_mask : None quality_filter_threshold : None random_seed : None read2_in : SYL1_S1_L008_R2_001.fastq.gz read2_out : SYL1_S1_L008_R2_001_extracted.fastq.gz read2_stdout : False reads_subset : None reconcile : False retain_umi : None short_help : None stderr : <open file '', mode 'w' at 0x7fc28478b1e0> stdin : <gzip open file 'SYL1_S1_L008_R1_001.fastq.gz', mode 'rb' at 0x7fc264ab3c90 0x7fc264a13750> stdlog : <open file '', mode 'w' at 0x7fc28478b150> stdout : <gzip open file 'SYL1_S1_L008_R1_001_extracted.fastq.gz', mode 'wb' at 0x7fc264ab3d20 0x7fc28464be50> timeit_file : None timeit_header : None timeit_name : all whitelist : whitelist.txt

2018-10-26 13:08:37,092 ERROR barcode pattern(s) do not include any umi bases (marked with 'Ns') (?P.{16})(?P.{10}), None Traceback (most recent call last): File "/usr/local/anaconda2/bin/umi_tools", line 11, in sys.exit(main()) File "/usr/local/anaconda2/lib/python2.7/site-packages/umi_tools/umi_tools.py", line 59, in main module.main(sys.argv) File "/usr/local/anaconda2/lib/python2.7/site-packages/umi_tools/extract.py", line 264, in main options.pattern, options.pattern2)) File "/usr/local/anaconda2/lib/python2.7/site-packages/umi_tools/Utilities.py", line 1118, in error raise ValueError("UMI-tools failed with an error. Check the log file") ValueError: UMI-tools failed with an error. Check the log file

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/289, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjjPBj-VdHuFKCCTYm726-_BeLkWtks5uopzvgaJpZM4X7jd9 .

helianthuszhu commented 5 years ago

Thanks you so much. I have run it successfully using --extract_method=regex in the second step. Thanks again.

IanSudbery commented 5 years ago

By the way, if you are processing 10x data, you might like to try alevin (https://salmon.readthedocs.io/en/latest/alevin.html). It has a de-duplication algorithm inspired by UMI-tools, but which properly accounts for transcript ambiguity and runs much faster than the UMI-tools pipeline.