ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 110 forks source link

LinearAlignment.mappingQuality Scale is Undefined #229

Closed adamnovak closed 9 years ago

adamnovak commented 9 years ago

Per @jeromekelleher's comments in #198, the mappingQuality field of LinearAlignment, which I also included in my new GraphAlignment, does not document how it's integer value corresponds to "how likely the read maps to this position as opposed to other locations".

Presumably this is meant to be a Phred-scale value, -10 log(P(alignment is wrong)). However, it's a little unclear what "wrong" means in this context, and how values like 0 or 255 (which I believe sometimes have special behavior for some tools that work with SAM/BAM MAPQ) ought to be handled.

How, if at all, should we elaborate on the SAM specification here?

delagoya commented 9 years ago

Please add a documentation PR to this issue if you want to move it forward. Thanks!

ekg commented 9 years ago

@adamnovak I'm not sure the spec should encode this for exactly the reason you note:

However, it's a little unclear what "wrong" means in this context, and how values like 0 or 255 (which I believe sometimes have special behavior for some tools that work with SAM/BAM MAPQ) ought to be handled.

Things of this nature should be defined in metadata. There should be a very simple standard for saying "this field on this object means X" in a header. We can suggest standard usage of particular fields and then get out of the way.

For example, in the header of my graph alignment results, I could specifically state that the mapping quality was assigned by the algorithm defined in this function at this git commit.

I actually think that VCF does well in this regard. I've wished many times that all the fields, including the quality, were defined in the header and only used in particular ways by convention. The positional, required, fields have caused the greatest pain when writing methods to work with data in VCF, except when using "dumb" tools on the unix command line. (The only thing worse are the genotype likelihoods, which are embedded in a format that requires a special algorithm to unpack and doesn't even work for mixed ploidy.)

delagoya commented 9 years ago

@adamnovak would you like to keep this issue open? Is there an associated PR?

delagoya commented 9 years ago

I am still not seeing a PR for the issue, and no comments in 28 days. Closing