Request for pgen files to include a version number in the header

cmnbroad commented 9 months ago

We (Broad Institute/All of Us) are planning to start generating some large pgen datasets using a pgen writer that we've implemented, and are a bit concerned that the pgen file format doesn't seem to include an embedded version number. Our pgen writer code is based on an alpha version of plink2, and we're concerned that if the file format definition ever needs to change, there is no way for plink2 or other consumers to detect that a given file is from a past or future, and is therefore incompatible with the executing code.

Is there any way to handle this currently, and if not, can a version number be added ?

chrchang commented 9 months ago

In the current design, the storage-mode (3rd) and header-format byte(s) (usually just the 12th byte, but allowed by the spec to be multibyte) serve this function. (Note that variant-major .bed files are treated as an old pgen version under this scheme.)

chrchang commented 9 months ago

One explicit question raised by the current design is, what happens when the multiallelic-dosage storage format is finalized? I have tried to set things up so that an older pgen reader will be able to read multiallelic possibly-phased hardcalls from a future pgen generated by a writer that has also chosen to store dosages. This is similar to how one can write a working pgen reader that basically ignores all dosage-related parts of the specification, or all dosage- and all phase-related parts, or all dosage- and all phase- and all-multiallelic-variant-related parts. So there is one future backward-compatible update that is essentially hardcoded into the current spec.

But the general case is handled by defining new storage-modes, which may or may not have nice backward-compatibility properties. E.g. the existing pgenlib code includes comments expressing an intent to make 0x11 identical to 0x10 except for a new phase-set data track; if something like that ends up happening, that amounts to a mostly-backward-compatible version update, and it won't be difficult to patch an existing reader to accept the new version while ignoring the new data track.

chrchang commented 9 months ago

I am open to making a near-future spec change defining storage-mode 0x11 as "0x10 with a version number", with guarantees on what won't change with each type of version number update.

cmnbroad commented 9 months ago

It does seem like your last suggestion (0x11 as "0x10 with a version number") would be a step in the right direction, though in the best case I would even go one step further, and put a real spec version number right after the magic number in the header.

Also, as a related issue, I also would have liked to have some way for my pgen writer to tag the pgen files I generate as being sourced by my writer (as opposed to by plink directly), in order to distinguish them in the future in case there is any issue. In my case, its mostly pgenlib code doing the actual writing anyway, but I have a C++ layer in front of that, and a Java JNI layer in front of that. So I settled for stamping the VCF header in the accompanying .pvar, similar to what plink2 does, but with my code's version number in it, i.e, ##source="Broad Institute PGEN/PVAR writer version=0.2.8b-SNAPSHOT". So there is at least some provenance.

chrchang commented 7 months ago

Planning to make the following addition to the specification soon (and release the corresponding plink2 / pgenlibr / Python pgenlib forward-compatibility updates); let me know if you see any problems.

0x11: Mode 0x10 with ignorable extensions. This adds a few bytes to the end of the header, possibly a few bytes to the end of the .pgen file, and can in principle introduce references to other files.

The body of the header (outside this third byte) is as in mode 0x10. The following is appended:

A flag varint describing which header extensions are present. As of this writing, the following header extension is defined: 0x2: PGEN writer identifier The 0x1 flag can be safely used by developers for their own purposes.
A flag varint describing which footer extensions are present. As of this writing, no footer extensions are defined. The 0x1 flag can be safely used by developers for their own purposes: if footer extension(s) are added to future versions of this specification, they will be assigned higher bits.
If at least one footer extension is present, uint64 byte offset of footer start.
For each present header extension, a varint indicating its byte length. E.g. if this value is 0x7, this part will have three varints, corresponding to flag bits 0x1, 0x2, and 0x4 in that order.
Bodies of header extensions, in order.

The body of the footer corresponds to (4) followed by (5) above.

The PGEN writer identifier is a UTF-8 string, with no terminator.

0x21: Mode 0x20 with ignorable extensions. Header file has mode byte 0x31, extensions work the same way as for mode 0x11.

cmnbroad commented 6 months ago

Seems reasonable enough - thanks for the updates. Will the 0x2 PGEN writer identifier be exposed via the api, i.e.,plink2::SpgwInitPhase1, or STPgenWriterStruct or something similar, so that we can pass it in through to pgenlib from our writer code ?

chrchang commented 6 months ago

Yes, this will be exposed soon in the C/C++ API.

chrchang commented 6 months ago

Specification has been updated.

Sample plink2::SpgwInitPhase1Ex() call (can be invoked by "--make-pgen writer-ver"): https://github.com/chrchang/plink-ng/commit/dfabcf669d96f4c531a851d43d66f8e50e5fa554#diff-2d7f1e57f1cc7e7dc7a6a9f6f98414c7d839533d0460962aad574e87f0160e82R6474

Sample reading logic (can be invoked by --pgen-info): https://github.com/chrchang/plink-ng/commit/dfabcf669d96f4c531a851d43d66f8e50e5fa554#diff-e3af826ba80b15835975fa856c1081bdd94a6501db99e134ab3f61b4245d43d0R1101 https://github.com/chrchang/plink-ng/commit/dfabcf669d96f4c531a851d43d66f8e50e5fa554#diff-e3af826ba80b15835975fa856c1081bdd94a6501db99e134ab3f61b4245d43d0R1137

chrchang / plink-ng

Request for pgen files to include a version number in the header #258