broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.64k stars 579 forks source link

Bad sequence dictionaries in gCNV test data #6957

Open ldgauthier opened 3 years ago

ldgauthier commented 3 years ago

Now that there's more rigorous sequence dictionary validation a bunch of dictionaries don't jive with the reference, especially files of the form src/test/resources/org/broadinstitute/hellbender/tools/copynumber/gcnv-postprocess/shard_0-calls/interval_list.tsv

samuelklee commented 3 years ago

Aren't those consistent with the count files in src/test/resources/org/broadinstitute/hellbender/tools/copynumber/gcnv-sim-data/?

See some possibly related discussion in https://github.com/broadinstitute/gatk/issues/6924

However, it's possible that test data needs resyncing more generally, including the model/call files used in the WDL tests.

ldgauthier commented 3 years ago

Those dictionaries in the counts are the same, so they agree with the calls and the model shards, but they don't agree with the reference. I have seen no reference where chromosome 1 is smaller than chromosome 2.

samuelklee commented 3 years ago

These are count files containing counts simulated according to the gCNV model, and are used as the primary inputs to generate the rest of the test files downstream. The dictionary is fictional and immaterial, so long as it is carried forward consistently from the count files to downstream outputs (which the tools do automatically). You could pretend that we started with simulated BAMs that were all aligned w.r.t. this fictional reference/dictionary. (It's just more work to simulate at the BAM level, which is why we start with the counts; simulating BAMs would also not increase test coverage in any meaningful way.)

There are already a number of checks to guarantee dictionary consistency, but as you can see from that issue, there are a few scenarios where dictionaries from intervals might not be available via the engine. I don't think there should be a need to provide or check against external references/dictionaries at downstream steps; see point 4 in https://github.com/broadinstitute/gatk/issues/6924#issuecomment-719576249 about reverting the code change to allow external dictionaries in PostprocessGermlineCNVCalls.

If you like, you can switch the dictionaries in the test files to something real/canonical, but I would still revert the code change. Again, I'm not sure if there was some additional context that I'm missing---e.g., is there some big analysis going on where they need to override sequence dictionaries and/or checks due to mismatched dictionaries in the BAMs? If so, could this be addressed with e.g. UpdateVcfSequenceDictionary or some other solution?

samuelklee commented 3 years ago

Just to provide additional context, all of the machinery in the AbstractRecordCollection classes for reading/writing CNV input/output TSVs was meant to make passing metadata (dictionaries, sample names, etc.) from tool to tool as automatic and consistent as possible. This avoids having to re-provide sample names, dictionaries, etc. at each tool/step---which often led to dictionary inconsistencies, contig/sample ordering bugs, etc. in older versions of the pipelines---at the cost of 1) redundantly carrying along this metadata in input/output TSVs, and 2) requiring consistent dictionaries in all initial BAM inputs. I think these are small costs to pay.

ldgauthier commented 3 years ago

This is very useful test data, but the fact that the sequence dictionary doesn't match any known reference precludes it from being used with some other tools. If it's easier to construct a fake reference that goes with these, then we could do that.

samuelklee commented 3 years ago

If that's the case, it might indeed be worth switching dictionaries/contigs to something real. You can probably shift the interval starts/ends as well, if you need them to overlap in some way with other test data, but I'd keep the split by contig and the bin sizes the same.

samuelklee commented 3 years ago

Hmm, now that I look at the code, I confused myself at some point about whether PostprocessGermlineCNVCalls allows overriding of the dictionary---ignore all the rigmarole about point 4 on the other issue, I think there actually isn't any code change we need to revert. (Guess this is what happens when looking at code/issues across the appropriate combination of paternity leave, weekends, and Friday evening, as well as when referencing out-of-date tutorials...)