bigdatagenomics / bdg-formats

Open source formats for scalable genomic processing systems using Avro. Apache 2 licensed.
Apache License 2.0
38 stars 36 forks source link

Add quality field to variant #146

Closed fnothaft closed 7 years ago

fnothaft commented 7 years ago

Needed for https://github.com/bigdatagenomics/avocado/issues/253

heuermh commented 7 years ago

We discussed this earlier and decided it didn't make sense to store a quality score per variant because of our split-allelic model. Has that argument changed?

fnothaft commented 7 years ago

Yeah, I think that may've been a miscommunication. I don't think there's a meaningful way to generate a "correct" quality score for multiallelic sites that we split, but there's a lot of benchmarking tooling that relies on the variant quality being there, and the score is meaningful in the common case (a biallelic variant).

fnothaft commented 7 years ago

I may have also just changed my opinion over time. That said, I need the field. :-p

heuermh commented 7 years ago

I don't think there's a meaningful way to generate a "correct" quality score for multiallelic sites that we split...

Right, that was the reason for dropping several fields that are reserved keys in the VCF specification from our formats. I don't have a problem bringing Variant.quality back, but what should we do in the splitting case?

fnothaft commented 7 years ago

Yeah, I'm not 100% sure there, but since we set the splitFromMultiallelic field to true, I'm currently leaning towards preserving the value and letting splitFromMultiallelic == true be advisory that the value might be wrong.