CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
480 stars 190 forks source link

--filtered-out doesn't work #471

Closed Freequilibrium closed 3 years ago

Freequilibrium commented 3 years ago

Dear UMI-tools group

I'm trying to use --filtered-out parameter to save fastq which do not match regex This is important feedback to help me improve regex umi extraction

however when I specify --filtered-out, then umi-tools extract doesn't work anymore

UMI-tools can work if I do not specify --filtered-out

# UMI-tools version: 1.1.1
# output generated by extract --extract-method=regex -I 3-DNA-Input_S11_R1_QC.fastq.gz -S test2.gz --bc-pattern=^(?P<umi_1>.{0,3})GGACGAGCTGTACAAGTAAA{e<=2} --subset-reads=10000 -v 1
# job started at Tue May  4 18:05:26 2021 on N420-Bioinformatic -- 83dd61fb-1083-40ae-91e4-92c8fc5ef168
# pid: 425183, system: Linux 5.4.0-26-generic #30-Ubuntu SMP Mon Apr 20 16:58:30 UTC 2020 x86_64
# blacklist                               : None
# compresslevel                           : 6
# correct_umi_threshold                   : 0
# either_read                             : False
# either_read_resolve                     : discard
# error_correct_cell                      : False
# extract_method                          : regex
# filter_cell_barcode                     : None
# filter_cell_barcodes                    : False
# filter_umi                              : None
# filtered_out                            : None
# filtered_out2                           : None
# ignore_suffix                           : False
# log2stderr                              : False
# loglevel                                : 1
# pattern                                 : ^(?P<umi_1>.{0,3})GGACGAGCTGTACAAGTAAA{e<=2}
# pattern2                                : None
# prime3                                  : None
# quality_encoding                        : None
# quality_filter_mask                     : None
# quality_filter_threshold                : None
# random_seed                             : None
# read2_in                                : None
# read2_out                               : False
# read2_stdout                            : False
# reads_subset                            : 10000
# reconcile                               : False
# retain_umi                              : None
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='3-DNA-Input_S11_R1_QC.fastq.gz' encoding='ascii'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
# stdout                                  : <_io.TextIOWrapper name='test2.gz' encoding='ascii'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_correct_log                         : None
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# whitelist                               : None
2021-05-04 18:05:26,715 INFO Starting barcode extraction
2021-05-04 18:05:27,408 INFO Input Reads: 10001
2021-05-04 18:05:27,408 INFO regex matches read1: 9252
2021-05-04 18:05:27,408 INFO Reads output: 9252
2021-05-04 18:05:27,408 INFO regex does not match read1: 749
# job finished in 0 seconds at Tue May  4 18:05:27 2021 --  1.38  0.28  0.00  0.00 -- 83dd61fb-1083-40ae-91e4-92c8fc5ef168

UMI-tools doesn't work when I specify --filtered-out

# UMI-tools version: 1.1.1
# output generated by extract --extract-method=regex -I 3-DNA-Input_S11_R1_QC.fastq.gz -S test2.gz --bc-pattern=^(?P<umi_1>.{0,3})GGACGAGCTGTACAAGTAAA{e<=2} --subset-reads=10000 -v 1 --filtered-out=out
# job started at Tue May  4 18:05:53 2021 on N420-Bioinformatic -- cffd2f72-d998-4d8e-a57e-7a9c4ed981ea
# pid: 425194, system: Linux 5.4.0-26-generic #30-Ubuntu SMP Mon Apr 20 16:58:30 UTC 2020 x86_64
# blacklist                               : None
# compresslevel                           : 6
# correct_umi_threshold                   : 0
# either_read                             : False
# either_read_resolve                     : discard
# error_correct_cell                      : False
# extract_method                          : regex
# filter_cell_barcode                     : None
# filter_cell_barcodes                    : False
# filter_umi                              : None
# filtered_out                            : out
# filtered_out2                           : None
# ignore_suffix                           : False
# log2stderr                              : False
# loglevel                                : 1
# pattern                                 : ^(?P<umi_1>.{0,3})GGACGAGCTGTACAAGTAAA{e<=2}
# pattern2                                : None
# prime3                                  : None
# quality_encoding                        : None
# quality_filter_mask                     : None
# quality_filter_threshold                : None
# random_seed                             : None
# read2_in                                : None
# read2_out                               : False
# read2_stdout                            : False
# reads_subset                            : 10000
# reconcile                               : False
# retain_umi                              : None
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='3-DNA-Input_S11_R1_QC.fastq.gz' encoding='ascii'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
# stdout                                  : <_io.TextIOWrapper name='test2.gz' encoding='ascii'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_correct_log                         : None
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# whitelist                               : None
2021-05-04 18:05:53,816 INFO Starting barcode extraction
Traceback (most recent call last):
  File "/home/imb-n420-cll/miniconda3/envs/RNASeq/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/home/imb-n420-cll/miniconda3/envs/RNASeq/lib/python3.8/site-packages/umi_tools/umi_tools.py", line 61, in main
    module.main(sys.argv)
  File "/home/imb-n420-cll/miniconda3/envs/RNASeq/lib/python3.8/site-packages/umi_tools/extract.py", line 446, in main
    filtered_out.write(str(read1) + "\n")
UnboundLocalError: local variable 'read1' referenced before assignment

How do I solve this problem ? Thank You

IanSudbery commented 3 years ago

This has been fixed in the lastest version on master. Are you able to install from there?

Freequilibrium commented 3 years ago

I installed umi_tools from bioconda, which version is 1.1.1 (and in home page of this github also shows the latest version is 1.1.1)

Or which method should I use ?

Thank You

IanSudbery commented 3 years ago

This fix isn't yet in a released version. To install it, activate your conda environment and then download the code of the latest version

wget https://github.com/CGATOxford/UMI-tools/archive/refs/heads/master.zip 

Unzip it

unzip  master.zip 

go into the folder than is create and run

python setup.py install
TomSmithCGAT commented 3 years ago

I've just released 1.1.2 which includes this bug fix. That should get into bioconda in the next few days. I'll check and add manually if not.

If you need if before then, follow @IanSudbery's instructions...

Freequilibrium commented 3 years ago

Thank You Very Much !