Daniel-Liu-c0deb0t / UMICollapse

Accelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
MIT License
62 stars 8 forks source link

Does this tool compute a consensus sequence? #28

Closed xiechangxiao closed 3 months ago

xiechangxiao commented 3 months ago

Thank you for providing such a good tool. I am analyzing high depth sequencing data, and I want to correct the wrong sequence while removing PCR duplications. Does this tool compute a consensus sequence when it removes duplicates? And how do you choose representative sequences?

Daniel-Liu-c0deb0t commented 3 months ago

Thanks for your question. This tool does not compute consensus. It picks the alignment with the highest mapq in SAM/BAM mode and the highest average base quality for FASTQ mode.

From the readme:

--merge: method for identifying which UMI to keep out of every two UMIs. Either any, avgqual, or mapqual. Default: mapqual for SAM/BAM mode, avgqual for FASTQ mode.

In my experience, consensus for short read sequences isn't really necessary if you are mapping the reads. Most of the variations between reads are going to be at the ends (adapters, etc.) and aligners will soft clip that.

xiechangxiao commented 3 months ago

Thanks for your reply! But for low VAF somatic variation (0.1% or less), a consensus of short read sequences is useful because there are some errors in sample preparation, library preparation, and sequencing that are randomly distributed. 360截图20240607092613081 Therefore, the accuracy of low-frequency somatic mutation detection in cfdna can be improved based on consensus sequence. There are some tools for calculating consensus sequences, e.g. Fgbio, but they are very slow. Do you have any plans to add this feature? @Daniel-Liu-c0deb0t

Daniel-Liu-c0deb0t commented 3 months ago

Ah I see, yeah it makes sense to do error correction. I don't have plans to add this feature right now unfortunately. I'm surprised this isn't a solved problem given that there's a massive increase in UMI processing tools in the past few years.

nh13 commented 2 months ago

@xiechangxiao feel free to bring this discussion over to fgbio.

We would welcome example datasets where fgbio tools perform poorly either in run time or analytically. We may even re-write them to achieve the performance that is needed (see how fqtk replaced the DemuxFastq tool). We also have ideas (we’ve prototyped publicly) to use MSA or POA to improve the consensus that is generated. And finally, we would welcome financial support for taking the time to implement these improvements, as that’s our business.