CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Whitelist: Can't proceed beyond "finding putative error cell barcodes" #496

Closed kasayadior closed 2 years ago

kasayadior commented 2 years ago

Hi, There is an issue generating whitelist: Log head:

UMI-tools version: 1.0.1
output generated by whitelist --stdin /directory/R1.fastq.gz --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNNNN --method=reads --set-cell-number=10000 --extract-method=string --stdout /directory/umitool_out.txt --log /directory/umitool_log.txt
job started at Thu Nov 18 17:42:02 2021 on server

Log tail:

2021-11-18 17:59:24,546 INFO Parsed 99900000 reads
2021-11-18 17:59:25,451 INFO Parsed 100000000 reads
2021-11-18 17:59:25,451 INFO Starting - whitelist determination
2021-11-18 17:59:28,695 INFO Finished - whitelist determination
2021-11-18 17:59:28,695 INFO Starting - finding putative error cell barcodes

And the log is not updating any further.


Traceback (most recent call last):
  File "/home/.conda/envs/pyenv/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/home/.conda/envs/pyenv/lib/python3.6/site-packages/umi_tools/umi_tools.py", line 61, in main
    module.main(sys.argv)
  File "/home/.conda/envs/pyenv/lib/python3.6/site-packages/umi_tools/whitelist.py", line 443, in main
    options.plot_prefix)
  File "/home/.conda/envs/pyenv/lib/python3.6/site-packages/umi_tools/whitelist_methods.py", line 464, in getCellWhitelist
    error_correct_threshold)
  File "/home/.conda/envs/pyenv/lib/python3.6/site-packages/umi_tools/whitelist_methods.py", line 422, in getErrorCorrectMapping
    if barcode_in_bytes in whitelist:  # don't check if whitelisted
KeyboardInterrupt

I tried a few more times but same thing happened.

Any help/advice is extremely appreciated.

IanSudbery commented 2 years ago

I see that eventually you terminated this with ctrl-c. How long did it run for before it reached this point?

IanSudbery commented 2 years ago

You probably just have a large data set with many CBs (you have a long CB) and thus it is taking a long time to find all of the CBs that could be errors of your above-knee CBs.

I have two suggestinon:

  1. Disable correction of CBs that are mutations of whitlisted CBs by setting --error-correct-threshold=0.
  2. Use a subset of your reads for whitelist generation by setting --subset-reads=10000000
kasayadior commented 2 years ago

Thank you for the reply.

Before I interrupted the program, it took ~ 15 minutes to "INFO Starting - whitelist determination", then it lingered at "finding putative error cell barcodes" for another 30 minutes.

I updated the script with the parameters you suggested and it worked! It takes 907s to run one data set.

IanSudbery commented 2 years ago

I'm closing this issue. Please reopen if you are still having problems.