FamilySearch / GEDCOM

Apache License 2.0
160 stars 20 forks source link

Does using URIs make GEDCOM less archival? #69

Open tychonievich opened 2 years ago

tychonievich commented 2 years ago

(originally posted by @emyoulation on #41; copied verbatim to its own issue)

I just ran into something that might need to be an archival consideration for posterity.

When trying to transfer some old XML files to a modern package. I discovered that some DTD and scheme URIs inside the file were bad: they had suffered linkrot or had been clobbered. As a result, the reader was not able to parse some of the chunks. (In this case, the chunks were metadata for the 2016 retired & 2019 deprecated Picasa face recognition for photos.)

Should some thought be given to specifying how to handle this so that our files maintain long-term integrity?

Maybe have a fallback specification for linkrot affected external references? And some sort of override for clobbered external references.

Superficially, this includes things in the specification like the "http://www.apache.org/licenses/LICENSE-2.0" external reference. In 50 years, will the Apache domain exist? Will we still be using nameservers that can resolve it?

Perhaps there should be an internally redundant Archival GEDCOM format that archives this specification with the the data file?

Reference v7.0.2 page 6 of 96: section "URIs and Prefix Notation"

What happens if the following domain is hacked and the schema is clobbered with mangled structure that RickRolls XML reader applications to a malicious schema? http://www.w3.org/2001/XMLSchema#

Originally posted by @emyoulation in https://github.com/FamilySearch/GEDCOM/issues/41#issuecomment-940066677

tychonievich commented 2 years ago

DTDs are used for discovery: that is, a parser is supposed to visit the URL they contain to obtain additional parsing information. v7.0 contains no discovery content. We have discussed adding it in a future release as an extension to the HEAD.SCHMA, but doing so is not currently on our short list of features to add.

v7.0 does include URIs inside datasets. However, they are used as identifiers, not as links: whether there is a useful page served when you make an HTTP request to a given URI or not, the identifier continues to serve as an identifier. v5.5.1 and earlier achieved a similar goal with an identifier registration system which fell into disuse. URIs have the advantage that they provide high (but not complete) confidence of uniqueness without the need for a specialized registration authority.

The specification does reference many external documents. These are generally large, widely-used documents owned by other standards bodies. Referencing them instead of including them helps acknowledge the parties who developed them; delegates the responsibility for correcting errata to those parties; provides a cognitive shortcut for developers who are familiar with them; and helps us focus on the genealogical content we are primarily responsible for.

You are probably correct that a time will come when accessing all of these documents is challenging. Gathering archival copies of file format definitions is more complicated than it sounds, and is something archivists do engage in (the library of congress hosts one of the largest public collections of third-party file formats that I know of). It is not a process I personally feel qualified to attempt.

dthaler commented 2 years ago

I believe the wayback machine solves the obsolete embedded links problem by having the text show the original text, but the embedded link is modified to go to a backup location.

Similarly IETF RFCs are immutable and may have obsolete links, but RFC 7990 explains how the RFC editor can solve this for new RFCs. The immutable format for new RFCs is XML. Automated tooling generates presentation formats like HTML, PDF-A (a great format for archival), etc. The tooling can change over time, so the presentation format might include a link to errata, or things like the wayback machine does, etc.

So while we can't change the text of older GEDCOM versions, we could create presentation formats like above that augments it either with clickable links, or margin notes or whatever, as long as it's clear what is the immutable part/format and what is the presentation.

For example https://www.rfc-editor.org/info/rfc7990 discusses the metadata such as how to cite the doc from other docs, and links to the actual document, https://www.rfc-editor.org/rfc/rfc7990.html contains the immutable content plus clickable links inline, plus a non-immutable header in gray.

dthaler commented 2 years ago

@tychonievich I would still like the steering committee to discuss this

tychonievich commented 2 years ago

Discussed 2021-10-12 While we do not see our current external links as needing changing at the present time, we may want to revisited the archival linkability of our own spec.