broadinstitute / seqr-loading-pipelines

hail-based pipelines for annotating variant callsets and exporting them to elasticsearch
MIT License
22 stars 20 forks source link

Refactor dataset CONFIG dict into models #772

Open jklugherz opened 4 months ago

jklugherz commented 4 months ago

Throughout the v03 pipeline code we rely heavily on the CONFIG dictionary defined in v03_pipeline/lib/reference_data/config.py, which holds external dataset configurations.

Refactor this dict into something more object oriented (like enums or python dataclasses) and replace all the references to the dict, (including the many mock config dicts in tests, like this), to make the code cleaner and more maintainable. (One issue with the dict is that we have no way to enforce requirements like if one key is present, another must also be.)

This is also a good time to add ReferenceGenome as an attribute of the dataset (suggested in a code review here).