FamilySearch / GEDCOM

Apache License 2.0
171 stars 22 forks source link

EXIDs cause up to a 5x size increase #177

Open dthaler opened 2 years ago

dthaler commented 2 years ago

Let's take the example in the migration guide:

1 RIN 9876

This is 11 bytes (including a newline). The migration guide says to convert to

1 EXID 9876 2 TYPE https://gedcom.io/terms/v7/RIN#MySystem

which is 57 bytes, over 5x larger than using RIN. The same size increase applies to, say, FamilySearch Source Description IDs compared to using an extension tag for it.

GEDCOM file size can affect many things, including disk usage, import/export time, bandwidth to transfer files, and there may be size maximums that get hit such as maximum size of attachments supported by some email systems. As such, FamilySearch GEDCOM 7 files may be much less performant or usable than GEDCOM 5.5.1 when use of external IDs are ubiquitous in a file.

The GEDZIP format can alleviate some of this, at the expense of even longer import/export time, and the loss of human readability of attachments.

Should we do anything about this in a future version?

In the meantime, applications that store make heavy use of external IDs may either use extension tags instead of EXIDs, or even just stick with GEDCOM 5.5.1 if they just use RIN/RFN/AFN.

tychonievich commented 2 years ago

Discussed in steering committee meeting. Several anecdotes of file sizes getting larger were shared, in some cases causing inconvenience to programmers and users. We discussed several solutions are possible (URI prefixes, extension tags, etc.). However, we did not deem fixing it to be a high priority.

tychonievich commented 1 year ago

I've recently been working on converting GEDCOM-X to GEDCOM 7, and in the process have found that over 25% of the file size is the URIs of EXID.TYPEs. Thinking about how this could be simpler, I propose we fix this for 7.1

Option 1: use tags
We could introduce a new datatye, `g7:type-TagURI` ```abnf TagURI = extTag / URI ``` where `URI` is defined in [STD 66](https://www.rfc-editor.org/info/std66) as an absolute URI (notably always including a colon, and thus never mistaken for a tag). We limit the `extTag` to be ones with exactly one tag definition in the schema structure, and define them to expand to the corresponding URI. We then use this datatype as the value type of `g7:EXID-TYPE`.
Option 2: prefix notation
We could be to add prefix notation, such as is done in [turtle](https://www.w3.org/TR/turtle/#prefixed-name) or [SPARQL](https://www.w3.org/TR/sparql11-query/#prefNames). In particular, we'd add a new structure `HEAD.SCHMA.PREFIX` with a payload mirroring that of `g7:TAG`: roughly `1*[a-z0-9] URI`, where the first token is the prefix and the rest is what it expands to.

While prefix notation is familiar outside of GEDCOM, I worry about things like collisions (e.g. if someone defines the prefix https), empty prefixes (will all applications treat 2 TYPE ex: the same way?), processing order (can the second PREFIX structure use a prefix defined in the first?) and scoping (where does prefix expansion happen; for example, does it apply inside the payload of TAG, FILE, or NOTE?). I'm sure we can fix all of these, but feel option 1 is simpler to define and less error-prone than option 2.

emyoulation commented 1 year ago

Why not have the EXID reference a Source? Then additional information about the source (and perhaps a method for referencing the EXID via an API) could be detailed in the Source definition... or in a Citation for that Source.

This would be MUCH better than repeating a URI after every EXID. It would only require updating a single record if the URI suffers linkrot.