CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

provide an explicit list of cell barcodes to whitelist? #642

Open bbimber opened 4 months ago

bbimber commented 4 months ago

Hello -

In some workflows that could use umi-tools, we already have an explicit whitelist of the corrected cell barcodes. There is still a need to identify the non-error-corrected cell barcodes.

As I understand umi-tools, one can run the whitelist command and either give it a cell number, or let the tool infer the cell #. Is there any way to provide a list of allowable cell barcodes, and to let umi-tools generate the whitelist TSV to map cellbarcode to error-corrected barcodes?

TomSmithCGAT commented 4 months ago

Hi @bbimber. Just to clarify, you have a whitelist of cell barcodes and would like UMI-tools to automatically identify the acceptable cell barcodes which should be erorr-corrected to these cell barcodes. Is that correct?

If so, I'm afraid this isn't currently supported by UMI-tools.

When running umi_tools extract with a whitelist, the error barcodes need to be supplied in the format indicated here: cell barcode in column 1 and barcodes to correct to it in comma separated list in column 2.

It should be relatively trivial to determine for yourself what the error barcodes you wish to correct are if you already have a list of whitelisted cell barcodes. However, one issue will be that specifying all the possible error corrections without making reference to whether the barcode is actually observed, column 2 of the whitelist will get excessively long.

You could run umi_tools whitelist to generate the error mappings in the whitelist and then subset the whitelist using your pre-defined whitelist, but that seems a bit hacky and could run into issues where your pre-defined whitelist barcode wasn't in the umi-tools output.

Hmm.. answering your question, I see the issue now!

TomSmithCGAT commented 4 months ago

@IanSudbery, any objections to an option being added to whitelist to accept a pre-defined whitelist and then derive a sensible whitelist + error-corrections from the fastq?

It should be a simple addition of a new knee_method, perhaps with that parameter renamed. Other than some sanity checking for the presence of the pre-defined whitelisted CBs in the observed CBs, I can't see any other gotchas. Thoughts?

https://github.com/CGATOxford/UMI-tools/blob/9ce3a70a8b35ff9a066d73716680136be71cc70d/umi_tools/whitelist_methods.py#L469-L474

There is an option to define a error correction from just the whitelist CB sequences when reading in the whitelist in extract, but that's going to run into issues creating an excessively broad set of possible error corrections, since there is no checking that the error CBs are actually present in the data. I imagine the excessively broad whitelist might impact on runtime.

https://github.com/CGATOxford/UMI-tools/blob/9ce3a70a8b35ff9a066d73716680136be71cc70d/umi_tools/whitelist_methods.py#L501-L504

IanSudbery commented 4 months ago

I've no objection, other than to add that I'm not really all that au-fait with whitelist and its methods, so I won't really be able to help with support.

There is an option already to read a supplied whitelist into whitelist. What does this do?

bbimber commented 4 months ago

@TomSmithCGAT: yes, your description is pretty accurate. I considered the options you were suggesting, including making the TSV whitelist format myself. Like you said, the utility of having umi-tools generate the error-corrected barcodes is that it would be empirical based on data

IanSudbery commented 4 months ago

I think it would only be empirical in that it a list of all possible barcodes that could be corrected would be filtered by those actaully present.

I don't think it would make any different to the results. Where it might have a benefit is that the lists would be smaller, and therefore the extract process might be quicker/less memory consuming.