CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

umi_tools extract error when providing with a white list #500

Closed robinycfang closed 8 months ago

robinycfang commented 2 years ago

Hi,

I have paired end bulk sequencing data with UMIs (3 or 4 mers with a T). I have a list of ture UMI sequence. When I tried to extract UMIs from reads against the white list, I got the following error. However, when I got rid of the white list parameter, the error went away, but this extraction wouldn't be accurate. Any help would be really appreciated, thanks!

umi_tools extract --extract-method=regex --bc-pattern="(?P<umi_1>^[ACGT]{3}[ACG])(?P<discard_1>T)|(?P<umi_2>^[ACGT]{3})(?P<discard_2>T)" --bc-pattern2="(?P<umi_1>^[ACGT]{3}[ACG])(?P<discard_1>T)|(?P<umi_2>^[ACGT]{3})(?P<discard_2>T)" --whitelist=umi_list.txt -I sample_1.fq.gz --read2-in=sample_2.fq.gz --stdout=processed.1.fastq.gz --read2-out=processed.2.fastq.gz --log=processed.log error with umi-tools: Traceback (most recent call last): File "/centos7/umi_tools/1.1.1/bin/umi_tools", line 10, in <module> sys.exit(main()) File "/centos7/umi_tools/1.1.1/lib/python3.7/site-packages/umi_tools/umi_tools.py", line 61, in main module.main(sys.argv) File "/centos7/umi_tools/1.1.1/lib/python3.7/site-packages/umi_tools/extract.py", line 369, in main options.pattern, options.pattern2)) TypeError: 'str' object is not callable the same with 1.1.2: Traceback (most recent call last): File "/miniconda3/bin/umi_tools", line 8, in <module> sys.exit(main()) File "/miniconda3/lib/python3.9/site-packages/umi_tools/umi_tools.py", line 61, in main module.main(sys.argv) File "/miniconda3/lib/python3.9/site-packages/umi_tools/extract.py", line 367, in main U.error("barcode regex(es) do not include any cell groups " TypeError: 'str' object is not callable

IanSudbery commented 2 years ago

There are two things going on here: First you are passing a whitelist of Cell barcodes, not UMIs to be filtered, and this is causing an error because your barcode does not contain cell barcodes. Unfortunately the code that catches this error has itself an error in it! (This has now been fixed on the master branch).

If you wish to use a predetermined list of UMIs, then you should use the options --filter-umi --filter-umi-whitelist=umi_list.txt instead of --whitelist.

mortunco commented 2 years ago

Hi Folks,

I have a similar problem. but when I try @IanSudbery method, it seems --filter-umi-whitelist is not existed as an option.

(scStarrseq) [tmorova@linuxsrv006 use-alevin]$ umi_tools extract -I NL2_CKDL210021281-1a-SI_GA_A2_HMHFJDSX2_S3_L004_R1_001.fastq.gz --read2-in=NL2_CKDL210021281-1a-SI_GA_A2_HMHFJDSX2_S3_L004_R2_001.fastq.gz --stdout=umitools/processed.1.fastq.gz --read2-out=umitools/processed.2.fastq.gz --log2stderr --filter-umi --filter-umi-whitelist=umitools/10x-whitelist.txt --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNNNN

extract - Extract UMI from fastq

Usage:

   Single-end:
      umi_tools extract [OPTIONS] -p PATTERN [-I IN_FASTQ[.gz]] [-S OUT_FASTQ[.gz]]

   Paired end:
      umi_tools extract [OPTIONS] -p PATTERN [-I IN_FASTQ[.gz]] [-S OUT_FASTQ[.gz]] --read2-in=IN2_FASTQ[.gz] --read2-out=OUT2_FASTQ[.gz]

   note: If -I/-S are ommited standard in and standard out are used
         for input and output.  To generate a valid BAM file on
         standard out, please redirect log with --log=LOGFILE or
         --log2stderr. Input/Output will be (de)compressed if a
         filename provided to -S/-I/--read2-in/read2-out ends in .gz

For full UMI-tools documentation, see https://umi-tools.readthedocs.io/en/latest/

extract: error: no such option: --filter-umi-whitelist

here is my umit_tools version

(scStarrseq) [tmorova@linuxsrv006 use-alevin]$ umi_tools  --version
UMI-tools version: 1.1.1

Thank you for the help,

Best regards,

Tunc.

IanSudbery commented 2 years ago

Hi, sorry, my bad, the option is --umi-whitelist not --filter-umi-whitelist. For some reason these options have been hidden from the help. I'm not sure why, but it probably means this function has not been thoroughly tested and should be regarded as experimental.

@TomSmithCGAT do you remember why these options are hidden?

TomSmithCGAT commented 2 years ago

Yes, exactly that. I added it for a project where I was working with a library prep kit that included 96 pre-determined UMIs - Can't remember the kit name now. While it should be working absolutely fine, It's not been thoroughly tested.

chrarnold commented 2 years ago

Hi guys, following up on this, a quick question that you may be able to address: I am running extract with the --whitelist option with a list of 300 whitelisted cell barcodes (only one column, no error correction). When I compare the number of lines before and after running extract, they are unchanged, I expected this option to only retain reads when they are in the whitelist file? Is there an easy explanation, did I misunderstand something, or should I check further?

TomSmithCGAT commented 2 years ago

@chrarnold, that's correct. Only reads with whitelisted cells should be retained. Without error correction, they would need to have the exactly correct cell barcode. Given sequencing errors, this is unlikely to be the case, so I would expect some reads to be filtered.

Could you please post an example read and the umi_tools command used.

Can you do a quick sanity check and provide a random whitelist and confirm that all reads are filtered out.

TomSmithCGAT commented 8 months ago

I'm closing due to inactivity