CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

Error-correcting cell barcodes #529

Closed TomSmithCGAT closed 1 year ago

TomSmithCGAT commented 2 years ago

Another related question: Now I can use my customized whitelist. But my barcodes is specifically designed and there are at least 2bp differs among the barcodes, so I want to allow 1bp mismatch to the barcodes in the whitelist, how to do it ?

Originally posted by @wangjiawen2013 in https://github.com/CGATOxford/UMI-tools/issues/525#issuecomment-1099806090

TomSmithCGAT commented 2 years ago

Extract can only 'correct' cell barcode errors if you supply them within the whitelist file and use the --error-correct-cell option. You either need to use umi-tools whitelist to generate the whitelist, which will by default identify cell barcodes with 1 substitution as errors to be corrected, or else manually generate the file with the list of all possible errors to be corrected.

From docs for extract: https://umi-tools.readthedocs.io/en/latest/reference/extract.html#whitelist

--whitelist

Whitelist of accepted cell barcodes. The whitelist should be in the following format (tab-separated):

    AAAAAA    AGAAAA
    AAAATC
    AAACAT
    AAACTA    AAACTN,GAACTA
    AAATAC
    AAATCA    GAATCA
    AAATGT    AAAGGT,CAATGT

Where column 1 is the whitelisted cell barcodes and column 2 is the list (comma-separated) of other cell barcodes which should be corrected to the barcode in column 1.