exomiser / Exomiser

A Tool to Annotate and Prioritize Exome Variants
https://exomiser.readthedocs.io
GNU Affero General Public License v3.0
190 stars 54 forks source link

Create long-term archival format for variant data #503

Open julesjacobsen opened 11 months ago

julesjacobsen commented 11 months ago

The MVStore file format is not guaranteed to remain stable from one minor release to the next i.e. 2.1.x -> 2.2.x changes the format version from 2 -> 3 rendering data unreadable if the H2 version is updated.

For the H2 database there is an Upgrade utility which will export the data and import it to the new version, however for MVStore files there is no supported migration path (https://github.com/h2database/h2database/pull/3834#issuecomment-1639924153).

Consequently, we'll need to slightly re-think our new (v14+) variant database build strategy which previously merged a bunch of pre-parsed MVStore files for gnomAD, UK10K, ESP, ALFA and dbNSFP to create the final variants.mv.db release file. Instead it might be better to store them as gzip compressed protobuf which will be a lot quicker and easier to import into a new MVStore than the original files (especially gnomAD v3) and will also handle schema evolution.

Why not just use re-index the original VCF or create a new VCF? Well, it's yet another transformation to go through, it takes a lot longer to parse the info from the file and the file sizes are a lot larger than the protobuf.