Open dthaler opened 2 years ago
Discussed in steering committee meeting. Several anecdotes of file sizes getting larger were shared, in some cases causing inconvenience to programmers and users. We discussed several solutions are possible (URI prefixes, extension tags, etc.). However, we did not deem fixing it to be a high priority.
I've recently been working on converting GEDCOM-X to GEDCOM 7, and in the process have found that over 25% of the file size is the URIs of EXID.TYPEs. Thinking about how this could be simpler, I propose we fix this for 7.1
While prefix notation is familiar outside of GEDCOM, I worry about things like collisions (e.g. if someone defines the prefix https
), empty prefixes (will all applications treat 2 TYPE ex:
the same way?), processing order (can the second PREFIX
structure use a prefix defined in the first?) and scoping (where does prefix expansion happen; for example, does it apply inside the payload of TAG, FILE, or NOTE?). I'm sure we can fix all of these, but feel option 1 is simpler to define and less error-prone than option 2.
Why not have the EXID reference a Source? Then additional information about the source (and perhaps a method for referencing the EXID via an API) could be detailed in the Source definition... or in a Citation for that Source.
This would be MUCH better than repeating a URI after every EXID. It would only require updating a single record if the URI suffers linkrot.
Let's take the example in the migration guide:
This is 11 bytes (including a newline). The migration guide says to convert to
which is 57 bytes, over 5x larger than using RIN. The same size increase applies to, say, FamilySearch Source Description IDs compared to using an extension tag for it.
GEDCOM file size can affect many things, including disk usage, import/export time, bandwidth to transfer files, and there may be size maximums that get hit such as maximum size of attachments supported by some email systems. As such, FamilySearch GEDCOM 7 files may be much less performant or usable than GEDCOM 5.5.1 when use of external IDs are ubiquitous in a file.
The GEDZIP format can alleviate some of this, at the expense of even longer import/export time, and the loss of human readability of attachments.
Should we do anything about this in a future version?
In the meantime, applications that store make heavy use of external IDs may either use extension tags instead of EXIDs, or even just stick with GEDCOM 5.5.1 if they just use RIN/RFN/AFN.