Closed xiechangxiao closed 3 months ago
Thanks for your question. This tool does not compute consensus. It picks the alignment with the highest mapq in SAM/BAM mode and the highest average base quality for FASTQ mode.
From the readme:
--merge: method for identifying which UMI to keep out of every two UMIs. Either any, avgqual, or mapqual. Default: mapqual for SAM/BAM mode, avgqual for FASTQ mode.
In my experience, consensus for short read sequences isn't really necessary if you are mapping the reads. Most of the variations between reads are going to be at the ends (adapters, etc.) and aligners will soft clip that.
Thanks for your reply! But for low VAF somatic variation (0.1% or less), a consensus of short read sequences is useful because there are some errors in sample preparation, library preparation, and sequencing that are randomly distributed. Therefore, the accuracy of low-frequency somatic mutation detection in cfdna can be improved based on consensus sequence. There are some tools for calculating consensus sequences, e.g. Fgbio, but they are very slow. Do you have any plans to add this feature? @Daniel-Liu-c0deb0t
Ah I see, yeah it makes sense to do error correction. I don't have plans to add this feature right now unfortunately. I'm surprised this isn't a solved problem given that there's a massive increase in UMI processing tools in the past few years.
@xiechangxiao feel free to bring this discussion over to fgbio.
We would welcome example datasets where fgbio tools perform poorly either in run time or analytically. We may even re-write them to achieve the performance that is needed (see how fqtk replaced the DemuxFastq
tool). We also have ideas (we’ve prototyped publicly) to use MSA or POA to improve the consensus that is generated. And finally, we would welcome financial support for taking the time to implement these improvements, as that’s our business.
Thank you for providing such a good tool. I am analyzing high depth sequencing data, and I want to correct the wrong sequence while removing PCR duplications. Does this tool compute a consensus sequence when it removes duplicates? And how do you choose representative sequences?