Open cmnbroad opened 9 months ago
In the current design, the storage-mode (3rd) and header-format byte(s) (usually just the 12th byte, but allowed by the spec to be multibyte) serve this function. (Note that variant-major .bed files are treated as an old pgen version under this scheme.)
One explicit question raised by the current design is, what happens when the multiallelic-dosage storage format is finalized? I have tried to set things up so that an older pgen reader will be able to read multiallelic possibly-phased hardcalls from a future pgen generated by a writer that has also chosen to store dosages. This is similar to how one can write a working pgen reader that basically ignores all dosage-related parts of the specification, or all dosage- and all phase-related parts, or all dosage- and all phase- and all-multiallelic-variant-related parts. So there is one future backward-compatible update that is essentially hardcoded into the current spec.
But the general case is handled by defining new storage-modes, which may or may not have nice backward-compatibility properties. E.g. the existing pgenlib code includes comments expressing an intent to make 0x11 identical to 0x10 except for a new phase-set data track; if something like that ends up happening, that amounts to a mostly-backward-compatible version update, and it won't be difficult to patch an existing reader to accept the new version while ignoring the new data track.
I am open to making a near-future spec change defining storage-mode 0x11 as "0x10 with a version number", with guarantees on what won't change with each type of version number update.
It does seem like your last suggestion (0x11 as "0x10 with a version number") would be a step in the right direction, though in the best case I would even go one step further, and put a real spec version number right after the magic number in the header.
Also, as a related issue, I also would have liked to have some way for my pgen writer to tag the pgen files I generate as being sourced by my writer (as opposed to by plink directly), in order to distinguish them in the future in case there is any issue. In my case, its mostly pgenlib code doing the actual writing anyway, but I have a C++ layer in front of that, and a Java JNI layer in front of that. So I settled for stamping the VCF header in the accompanying .pvar, similar to what plink2 does, but with my code's version number in it, i.e, ##source="Broad Institute PGEN/PVAR writer version=0.2.8b-SNAPSHOT"
. So there is at least some provenance.
Planning to make the following addition to the specification soon (and release the corresponding plink2 / pgenlibr / Python pgenlib forward-compatibility updates); let me know if you see any problems.
0x11: Mode 0x10 with ignorable extensions. This adds a few bytes to the end of the header, possibly a few bytes to the end of the .pgen file, and can in principle introduce references to other files.
The body of the header (outside this third byte) is as in mode 0x10. The following is appended:
The body of the footer corresponds to (4) followed by (5) above.
The PGEN writer identifier is a UTF-8 string, with no terminator.
0x21: Mode 0x20 with ignorable extensions. Header file has mode byte 0x31, extensions work the same way as for mode 0x11.
Seems reasonable enough - thanks for the updates. Will the 0x2 PGEN writer identifier be exposed via the api, i.e.,plink2::SpgwInitPhase1
, or STPgenWriterStruct
or something similar, so that we can pass it in through to pgenlib from our writer code ?
Yes, this will be exposed soon in the C/C++ API.
Specification has been updated.
Sample plink2::SpgwInitPhase1Ex()
call (can be invoked by "--make-pgen writer-ver"): https://github.com/chrchang/plink-ng/commit/dfabcf669d96f4c531a851d43d66f8e50e5fa554#diff-2d7f1e57f1cc7e7dc7a6a9f6f98414c7d839533d0460962aad574e87f0160e82R6474
Sample reading logic (can be invoked by --pgen-info): https://github.com/chrchang/plink-ng/commit/dfabcf669d96f4c531a851d43d66f8e50e5fa554#diff-e3af826ba80b15835975fa856c1081bdd94a6501db99e134ab3f61b4245d43d0R1101 https://github.com/chrchang/plink-ng/commit/dfabcf669d96f4c531a851d43d66f8e50e5fa554#diff-e3af826ba80b15835975fa856c1081bdd94a6501db99e134ab3f61b4245d43d0R1137
We (Broad Institute/All of Us) are planning to start generating some large pgen datasets using a pgen writer that we've implemented, and are a bit concerned that the pgen file format doesn't seem to include an embedded version number. Our pgen writer code is based on an alpha version of plink2, and we're concerned that if the file format definition ever needs to change, there is no way for plink2 or other consumers to detect that a given file is from a past or future, and is therefore incompatible with the executing code.
Is there any way to handle this currently, and if not, can a version number be added ?