HKU-BAL / Clair3

Clair3 - Symphonizing pileup and full-alignment for high-performance long-read variant calling
246 stars 27 forks source link

Try to remove the default depth filter in the pileup. Also fix the version #292

Closed dpryan79 closed 7 months ago

dpryan79 commented 7 months ago

By default, htslib has a maximum heap size sufficient to store a few thousand reads, resulting in clair3 only using this many reads for calling and reporting. In most use-cases that's not an issue, but in high-coverage amplicon-seq this is undesired, since you're more likely to get wildly incorrect fold-changes and other coverage metrics. I ran into the same issue in a tool I developed using htslib and there the fix turned out to be a simple 1-line change. It'd be good if someone confirmed that this small change fixes that issue :)

I also noticed that clair3 is reporting the wrong version in the headers it produces. I've updated that to 1.0.7, which would presumably be the next release. Ideally this would get updated automatically and though given that this is a mix of shell scripts and python that may be more hassle than it's worth.

aquaskyline commented 7 months ago

Thank you Devon for the PR. In your amplicon sequencing test case, how high the coverage can go? Using bam_mplp_set_maxcnt is the way to go. We can set it to a reasonably high value so most of the amplicon sequencing cases can be covered. But setting it to INT_MAX might be too high and makes the htslib vulnerable to malformed BAM input.

dpryan79 commented 7 months ago

Hi @aquaskyline ! I have up to ~1,000,000x data (yes, that is absurdly high, I didn't generate it). In practice samtools coverage has identical C code, so I wouldn't be too concerned with malformed BAM input causing issues (the samtools folks would have run into this already).

Cf. https://github.com/samtools/samtools/blob/d92b49e81b89144096dde8b64a96f262fe86edae/coverage.c#L561

aquaskyline commented 7 months ago

Understood. I will test out bam_mplp_set_maxcnt(mplp, 2^20);, which should be enough for your case. If no problem we will have it in v1.0.7. And thanks for spotting the version mistake in the header!

dpryan79 commented 7 months ago

Sounds great, thanks for the extremely fast reply!