CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Error when running umi_tools whitelist TypeError: can only concatenate str (not "NoneType") to str #494

Closed singlecellfan closed 6 months ago

singlecellfan commented 2 years ago

Hi all,

I am trying to run umi_tools whitelist on a fastq file from a BD Rhapsody experiment. I have added regex with the according term for BD Rhapsody data. However, I still get the following error: "TypeError: can only concatenate str (not "NoneType") to str"

Anybody know what to do here? It seems like a Python problem. Nevertheless, I am hoping that someone has encountered the error before and know what to do.

Thanks in advance!

IanSudbery commented 2 years ago

Hi,

Is it possible to post the full traceback so we can see where this is coming from?

singlecellfan commented 2 years ago

Hi, This is the traceback I am getting:

2021-11-16 19:00:09,293 INFO Starting barcode extraction 2021-11-16 19:00:09,294 INFO Parsed 0 reads Traceback (most recent call last): File "/home/s051n/.conda/envs/environment/bin/umi_tools", line 11, in sys.exit(main()) File "/home/s051n/.conda/envs/environment/lib/python3.8/site-packages/umi_tools/umi_tools.py", line 61, in main module.main(sys.argv) File "/home/s051n/.conda/envs/environment/lib/python3.8/site-packages/umi_tools/whitelist.py", line 388, in main barcode_values = ReadExtractor.getBarcodes(read1) File "/home/s051n/.conda/envs/environment/lib/python3.8/site-packages/umi_tools/extract_methods.py", line 294, in _getBarcodesRegex new_seq, new_quals) = ExtractBarcodes( File "/home/s051n/.conda/envs/environment/lib/python3.8/site-packages/umi_tools/extract_methods.py", line 134, in ExtractBarcodes cell_barcode += groupdict[k] TypeError: can only concatenate str (not "NoneType") to str

IanSudbery commented 2 years ago

Hmmm.... That shouldn't be happening. Can you send me the top of the log output (with the command line and UMI-tools' understanding of the options set) and perhaps a handful of example reads?

singlecellfan commented 2 years ago

Hi, so apperently a mistake happened while copying the regex term. umi_tools whitelist is now running but does not write anything in the output file. See error below:

This is the code I am running: umi_tools whitelist --stdin /path to file/C4_5_3_WTA_S1_R1_001.fastq.gz --extract-method=regex -p "(?.{9}) (?.{12})(?.{9})(?.{13})(?.{9})(?.{8})T+" --method=umis --knee-method=distance --log2stderr > whitelist.txt

I have also run this code: umi_tools whitelist --stdin /omics/odcf/analysis/OE0246_projects/hh/Schayan/WTA_5prime_3prime/C4_5_3_WTA_S1_R1_001.fastq.gz --bc-pattern="(?.{9}) (?.{12})(?.{9})(?.{13})(?.{9})(?.{8})T+" --extract-method=regex --method=umis --knee-method=distance --log2stderr > whitelist.txt

Output log:

UMI-tools version: 1.1.2
output generated by whitelist --stdin /omics/odcf/analysis/OE0246_projects/hh/Schayan/WTA_5prime_3prime/C4_5_3_WTA_S1_R1_001.fastq.gz --extract-method=regex -p (?<cell_1>.{9}) (?<discard_1>.{12})(?<cell_2>.{9})(?<discard_2>.{13})(?<cell_3>.{9})(?<umi_1>.{8})T+ --method=umis --knee-method=distance --log2stderr
job started at Wed Nov 17 12:06:13 2021 on odcf-cn34u18s04 -- b4b5c1d7-f606-4730-8499-4278ba65e1d0
pid: 16779, system: Linux 3.10.0-1160.25.1.el7.x86_64 #1 SMP Wed Apr 28 21:49:45 UTC 2021 x86_64
allow_threshold_error                   : False
blacklist_tsv                           : None
cell_number                             : False
compresslevel                           : 6
ed_above_threshold                      : None
error_correct_threshold                 : 1
expect_cells                            : False
extract_method                          : regex
filtered_out                            : None
filtered_out2                           : None
ignore_suffix                           : False
knee_method                             : distance
log2stderr                              : True
loglevel                                : 1
method                                  : umis
pattern                                 : (?<cell_1>.{9}) (?<discard_1>.{12})(?<cell_2>.{9})(?<discard_2>.{13})(?<cell_3>.{9})(?<umi_1>.{8})T+
pattern2                                : None
plot_prefix                             : None
prime3                                  : None
random_seed                             : None
read2_in                                : None
short_help                              : None
stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
stdin                                   : <_io.TextIOWrapper name='/omics/odcf/analysis/OE0246_projects/hh/Schayan/WTA_5prime_3prime/C4_5_3_WTA_S1_R1_001.fastq.gz' encoding='ascii'>
stdlog                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
stdout                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
subset_reads                            : 100000000
timeit_file                             : None
timeit_header                           : None
timeit_name                             : all
tmpdir                                  : None
whitelist_tsv                           : None
2021-11-17 12:06:13,560 INFO Starting barcode extraction
2021-11-17 12:06:13,561 INFO Parsed 0 reads
2021-11-17 12:06:14,030 INFO Parsed 100000 reads
2021-11-17 12:06:14,485 INFO Parsed 200000 reads
2021-11-17 12:06:14,934 INFO Parsed 300000 reads
.....
.....
.....
.....

2021-11-17 12:36:09,749 INFO Parsed 377500000 reads
2021-11-17 12:36:10,254 INFO Parsed 377600000 reads
2021-11-17 12:36:10,763 INFO Parsed 377700000 reads
2021-11-17 12:36:11,049 INFO Starting - whitelist determination
Traceback (most recent call last):
  File "/home/s051n/.conda/envs/environment/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/home/s051n/.conda/envs/environment/lib/python3.8/site-packages/umi_tools/umi_tools.py", line 61, in main
    module.main(sys.argv)
  File "/home/s051n/.conda/envs/environment/lib/python3.8/site-packages/umi_tools/whitelist.py", line 454, in main
    cell_whitelist, true_to_false_map = whitelist_methods.getCellWhitelist(
  File "/home/s051n/.conda/envs/environment/lib/python3.8/site-packages/umi_tools/whitelist_methods.py", line 472, in getCellWhitelist
    cell_whitelist = getKneeEstimateDistance(
  File "/home/s051n/.conda/envs/environment/lib/python3.8/site-packages/umi_tools/whitelist_methods.py", line 322, in getKneeEstimateDistance
    distToLine, idxOfBestPoint = getKneeDistance(values)
  File "/home/s051n/.conda/envs/environment/lib/python3.8/site-packages/umi_tools/whitelist_methods.py", line 282, in getKneeDistance
    firstPoint = allCoord[0]
IndexError: index 0 is out of bounds for axis 0 with size 0

Any help is much appreciated! Thanks.

IanSudbery commented 2 years ago

Looks to me like its failing to find a knee. @TomSmithCGAT?

IanSudbery commented 2 years ago

I just looked at this again, and the only way I can see that this might happen is if no barcodes have been found in any of the reads.

singlecellfan commented 2 years ago

Hi, thanks for getting back to me. Could it be due to our data/read structure or the barcode tags? The input was BD Rhapsody fastq files. I thought with the regex term the difference in the barcode pattern should have been covered. Maybe I need to have a closer look at the barcode pattern and adapt the regex term.

GlancerZ commented 2 years ago

嗨, 谢谢你我可能会输入 因为我们的数据/结构是 BD 或者我可能回复的可能吗?并调整正则表达式术语模式。

Do you have any experience to deal with the BD data now?

TomSmithCGAT commented 6 months ago

Closing due to inactivity.