Open drtconway opened 2 months ago
Many thanks Tom. I have some ideas how to fix it but can't tell which one does less harm to other variant candidates. Is it possible that you send me a minibam containing only the reads covering chr12:132836482
? I gonna look into the details.
I'll have to consult - as I indicated, it's a clinical sample.
The smallest expansion would be to track RefCall
entries, either by adding to the tuple in the set, or perhaps keeping a separate set. Then, if the pileup variant is an indel, it "overrides" the refcall. As noted, indels are kind of special in the sense that they are recorded at a ref position immediately before the actual variant. If the variant from the full alignment is not a RefCall
then it is probably ok to drop the variant from the pileup.
If you're keen, I would suggest refactoring the code to do a "classic" merge - iterating over the two VCFs at the same time. That way the cases where things collide will appear as a natural clause in the inner loop, and you don't need memory proportional to the size of the full alignment VCF.
Unfortunately, I can't give you a BAM with the reads.
Hi!
Thanks for making a good tool!
We have been doing testing and benchmarking with (clinical) trio data, and have found a corner case where the behaviour of the merge between the pileup calls and the full alignment call results in a real variant being discarded.
The pileup VCF file contains the row
And the full alignment has the row
The VCF produced by
MergeVcf
outputs theRefCall
, which causes the insertion to be dropped.It's a bit of a corner case, since the position of insertion variants in a VCF is the base before the inserted sequence, and that is ref, but in this case, we do want to report the insertion!
I don't fully understand how the positions of the
RefCall
variants are determined, so I can't tell if this is a freak event, or if there is a systematic process that means these collisions are likely to occur elsewhere. If you could explain how theseRefCall
events are generated, that would help us understand how the caller works.Looking at the code of
MergeVcf
(e.g. https://github.com/HKU-BAL/Clair3/blob/b975475a24eae2dc564acc5d4788d99599d83feb/preprocess/MergeVcf.py#L191 and https://github.com/HKU-BAL/Clair3/blob/b975475a24eae2dc564acc5d4788d99599d83feb/preprocess/MergeVcf.py#L228), I think I understand how the merge code is dropping the insertion event. I can think of a few possible ways to fix the code, but I haven't thought through all the semantics in detail, to work out what the best way to fix the code would be.Again, thanks for making a good tool, and I hope this report helps you make it even better.
Tom.