Closed dpryan79 closed 7 months ago
Thank you Devon for the PR. In your amplicon sequencing test case, how high the coverage can go? Using bam_mplp_set_maxcnt
is the way to go. We can set it to a reasonably high value so most of the amplicon sequencing cases can be covered. But setting it to INT_MAX might be too high and makes the htslib vulnerable to malformed BAM input.
Hi @aquaskyline ! I have up to ~1,000,000x data (yes, that is absurdly high, I didn't generate it). In practice samtools coverage
has identical C code, so I wouldn't be too concerned with malformed BAM input causing issues (the samtools folks would have run into this already).
Cf. https://github.com/samtools/samtools/blob/d92b49e81b89144096dde8b64a96f262fe86edae/coverage.c#L561
Understood. I will test out bam_mplp_set_maxcnt(mplp, 2^20);
, which should be enough for your case. If no problem we will have it in v1.0.7. And thanks for spotting the version mistake in the header!
Sounds great, thanks for the extremely fast reply!
By default, htslib has a maximum heap size sufficient to store a few thousand reads, resulting in clair3 only using this many reads for calling and reporting. In most use-cases that's not an issue, but in high-coverage amplicon-seq this is undesired, since you're more likely to get wildly incorrect fold-changes and other coverage metrics. I ran into the same issue in a tool I developed using htslib and there the fix turned out to be a simple 1-line change. It'd be good if someone confirmed that this small change fixes that issue :)
I also noticed that clair3 is reporting the wrong version in the headers it produces. I've updated that to 1.0.7, which would presumably be the next release. Ideally this would get updated automatically and though given that this is a mix of shell scripts and python that may be more hassle than it's worth.