CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

cell bases required in pattern for whitelist; unclear documentation #508

Closed cmatKhan closed 2 years ago

cmatKhan commented 2 years ago

I am not using this for single cell -- testing to see if it will work for a different application. My barcode does not have any 'cell bases'. When I run the command below:

> umi_tools whitelist \ 
    --method=reads \
    --extract-method=string \
    -I run_5399/PhiX_S1_R1_001.fastq.gz \
    --bc-pattern=NNNNNXXXXXXXXXXXXXXXXX \
    --read2-in=run_5399/PhiX_S1_R2_001.fastq.gz \
    --bc-pattern2=NNNNNNNNXXXX \
    -S whitelist.tsv

I get the following error:

2022-01-26 10:21:22,973 ERROR barcode pattern(s) do not include any cell bases (marked with 'Cs') NNNNNXXXXXXXXXXXXXXXXX, NNNNNNNNXXXX
Traceback (most recent call last):
  File "/home/oguzkhan/.local/bin/umi_tools", line 8, in <module>
    sys.exit(main())
  File "/home/oguzkhan/.local/lib/python3.8/site-packages/umi_tools/umi_tools.py", line 61, in main
    module.main(sys.argv)
  File "/home/oguzkhan/.local/lib/python3.8/site-packages/umi_tools/whitelist.py", line 346, in main
    U.error("barcode pattern(s) do not include any cell bases "
  File "/home/oguzkhan/.local/lib/python3.8/site-packages/umi_tools/Utilities.py", line 1396, in error
    raise ValueError("UMI-tools failed with an error. Check the log file")
ValueError: UMI-tools failed with an error. Check the log file

I see this in the documentation, which seems related:


Barcode extraction
--bc-pattern

    Pattern for barcode(s) on read 1. See --extract-method

--bc-pattern2

    Pattern for barcode(s) on read 2. See --extract-method

--extract-method

        There are two methods enabled to extract the umi barcode (+/- cell barcode). For both methods, the patterns should be provided using the --bc-pattern and --bc-pattern2 options.x

        string

            This should be used where the barcodes are always in the same place in the read.
                N = UMI position (required)
                C = cell barcode position (optional) # this in particular is what confuses me about the error message
                X = sample position (optional)

but I am not following what I am supposed to set...sorry if I am just being thick, but this part of the documentation isn't clear to me.

IanSudbery commented 2 years ago

So, whitelist is pretty much only used in single-cell sequencing. The idea of whitelist is to find allowable cell barcodes. If there are no cell barcodes, then they can't be whitelisted.

We document the specification of the barcodes here: https://umi-tools.readthedocs.io/en/latest/regex.html

But perhaps we are not clear enough that whitelist is for whitelisting cell barcodes, not UMIs, and that you don't need whitelist unless you are doing something with cell barcodes.

cmatKhan commented 2 years ago

I see -- I was trying to use whitelist incorrectly. This is what confused me:

C = cell barcode position (optional)

But, it also says in the description:

Extract cell barcodes and identify the most likely true cell barcodes

which is clear. Thank you for your help.