broadinstitute / gamgee

A C++14 library for NGS data formats
http://broadinstitute.github.io/gamgee/
MIT License
40 stars 13 forks source link

VariantHeader equality operator improvements #299

Open jmthibault79 opened 9 years ago

jmthibault79 commented 9 years ago

We saw some strange results while reviewing #293, so we should improve these operators and write more comprehensive tests for them.

One example: calling variant_individual_field_type() using a shared field name apparently succeeded.

We should also compare additional field details, like number and description.

MauricioCarneiro commented 9 years ago

in HTSLIB the shared and individual field names are encoded on the same header dictionary, so yes, requesting an individual field with a shared field will always "work".

droazen commented 9 years ago

Same dictionary, yes, but info, format, and filters are stored in separate array entries within each position in the dictionary, and the types can be different.

MauricioCarneiro commented 9 years ago

field index of shared and individual fields are the same. For example a DP in shared and individual will have the same index in each array (shared/indiv)

droazen commented 9 years ago

They have the same index, but within each entry there is a subarray for each type (info, format, filter), with separate type information

droazen commented 9 years ago

Eg., bcf_hdr_id2type() does this:

(hdr)->id[BCF_DT_ID][int_id].val->info[type]>>4 & 0xf

int_id is the index, but the info array (where type information is stored) is indexed by the BCFHT* type

droazen commented 9 years ago

Therefore, calling bcf_hdr_id2type(m_header.get(), BCF_HL_INFO, index); (which VariantHeader::shared_field_type() does) is not the same as calling bcf_hdr_id2type(m_header.get(), BCF_HL_FMT, index); (which VariantHeader::individual_field_type() does) -- they will access different elements in the info array within (hdr)->id[BCF_DT_ID][int_id].val

MauricioCarneiro commented 9 years ago

So what is the proposed change here?

droazen commented 9 years ago

This should have a bug label -- it is a bug, just not well-described

MauricioCarneiro commented 9 years ago

I don't quite understand why it's a bug. Just seems like the requirements weren't well defined and as implemented it accepts more than it should.

droazen commented 9 years ago

It's a bug -- (hdr)->id[BCF_DT_ID][int_id].val->info[BCF_HL_INFO] and (hdr)->id[BCF_DT_ID][int_id].val->info[BCF_HL_FMT] are independent values, and so the calls bcf_hdr_id2type(m_header.get(), BCF_HL_INFO, index) and bcf_hdr_id2type(m_header.get(), BCF_HL_FMT, index) should not be interchangeable, but they were found to be in this case.