epi2me-labs / wf-single-cell

Other
75 stars 39 forks source link

Use PCR duplicates for error correction? #109

Closed itslittman closed 4 months ago

itslittman commented 5 months ago

Is your feature related to a problem?

Some reads are inevitably thrown out due to lack of barcode, basecalling errors/below Qscore threshold, etc.

Describe the solution you'd like

If you have multiple reads with the same UMI, and some reads have low-quality bases in the cell barcode sequence, could you use the higher-quality barcode sequences from the PCR duplicates to correct the other barcode and retain the read? And could this likewise be used to correct internal nucleotide sequences?

Describe alternatives you've considered

-

Additional context

No response

nrhorner commented 5 months ago

Hi @itslittman

The cell barcodes are assigned first, and these are then used to partition the reads, along with gene name, to reduce the search space for UMI correction and to reduce UMI collisions. I guess there could be a rescue step after UMI assignment where the rejected reads, due to no valid barcode being found, could be fished out by UMI/gene ID. It's and interesting idea.

You have a second question about correcting internal nucleotides. This could be much more easily done by generating consensus sequences for reads with the same barcode/UMI/gene and might be something that will be added to the workflow.

nrhorner commented 4 months ago

Closing due to lack of response

HenriettaHolze commented 1 month ago

Hi @nrhorner , are you planning to implement to generate consensus sequences for reads with the same cell barcode/UMI/gene?
This feature could drastically improve my data, as >60% UMI of interest have >= 3 reads and I'm interested in SNV calling.
I'm currently trying to reproduce the implementation from sicelore https://github.com/ucagenomix/sicelore/tree/793db90c3d16fef31d8ad3f34792c595beff938a?tab=readme-ov-file#6-generate-consensus-sequences .
Please let me know if you have other suggestions how to error-correct UMI-tagged transcripts that is compatible with the epi2me workflow.
Thanks!