CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Error running whitelist #467

Closed prmunn closed 1 year ago

prmunn commented 3 years ago

When I run whitelist using the following command: umi_tools whitelist --knee-method=density --method=reads --plot-prefix Mix1_predictBC --allow-threshold-error --extract-method string --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNNCCCCCCCCCC --error-correct-threshold=2 --ed-above-threshold=correct -L Mix1_predictedBCwhitelist.log -I Mix1_I2_I1_padUMI_R2.fastq.gz -S Mix1_predictedBCwhitelist.txt

I get the following error: /programs/UMI-tools/lib64/python3.6/site-packages/umi_tools/whitelist_methods.py:202: UserWarning: Attempted to set non-positive left xlim on a log-scaled axis. Invalid limit will be ignored. fig3.set_xlim(0, len(counts)*1.25) Traceback (most recent call last): File "/programs/UMI-tools/bin/umi_tools", line 8, in sys.exit(main()) File "/programs/UMI-tools/lib64/python3.6/site-packages/umi_tools/umi_tools.py", line 61, in main module.main(sys.argv) File "/programs/UMI-tools/lib64/python3.6/site-packages/umi_tools/whitelist.py", line 455, in main resolution_method=options.ed_above_threshold) File "/programs/UMI-tools/lib64/python3.6/site-packages/umi_tools/whitelist_methods.py", line 543, in errorDetectAboveThreshold cell_whitelist = list(cell_whitelist) TypeError: 'NoneType' object is not iterable

And the predicted whitelist file has a size of zero. However, when I run the same command on a test dataset consisting of 100,000 records from the original dataset it runs without error and I results in my predicted whitelist file.

I've attached the log file from the original run that failed. Please help Mix1_predictedBCwhitelist.log

TomSmithCGAT commented 3 years ago

Hmmm, OK this one's on me.

In the lines below umi_tools whitelist obtains the initial whitelist, then (optionally) error corrects cell barcodes to it. It then checks if a whitelist was created and if so, writes it out, with counts per barcode. If it doesn't exist, it returns an warning/error explaining that no local minima could be found in the density plot of barcode counts, which is how the knee is identified with --knee-method=density. https://github.com/CGATOxford/UMI-tools/blob/289b9cc87f35bd06249ef6ae680e590524bc83f3/umi_tools/whitelist.py#L437-L491

The issue comes when using --knee-method=density --ed-above-threshold=correct and no knee is identified and hence, no whitelist generated. The warning/error should obviously occur prior to the error correction.

In short, the identification of the knee hasn't worked and this isn't caught at the right point.

To remedy this, I would suggest using --knee-method=distance, which is the default and should be more robust. You can inspect the plot afterwards to check you're happy with it.

If you do want to stick with knee-method=density, you can leave off --error-correct-threshold=2 --ed-above-threshold=correct to get around the above error, and include --allow-threshold-error so that the knee plots are generated. You can then inspect the plots and manually set the knee threshold in a subsequent run with --set-cell-number.

I'd favour taking the --knee-method=distance approach.

In the meantime, I'll update whitelist so this error is caught properly

One final comment, that barcode pattern looks very long. Do you really have a 26bp cell barcode?

prmunn commented 3 years ago

Thanks for responding so quickly. The knee method = distance appears to have worked. I'll try out your other suggestion with knee method = density and manually setting cell number.