extension tags in YAML format

dthaler commented 10 months ago

Currently https://gedcom.io/terms/format has:

Key extension tags

Type seq of extTag

Required by *

Allowed by types calendar, enumeration, month, structure

* Required instead of allowed if no standard tag is provided

A list, with the most-preferred tag first, of extension tags known to be used by applications for this concept.

Standard structures may have an extension tags entry to list fully compatible extensions that predated the standard and can be converted to the standard tag without any other modification. For example, 7.0's UID structure is fully compatible with the common 5.5.1 extension identified by tag _UID.

Key	`extension tags`
Type	`seq` of `extTag`
Required by	*
Allowed by	`type`s `calendar`, `enumeration`, `month`, `structure`

However, extTag is ambiguous. That is, two separate applications might use the same extTag with very different meanings, even under the same superstructure. As such, simply listing the extTag under extension tags can cause tools that consume the YAML to do the wrong thing with GEDCOM files. A URI on the other hand would be unambiguous. So would the combination of HEAD.SOUR payload plus extTag.

I claim that the new subsumes key can be used to more accurately represent the intent of the extension tags key, in an unambiguous way. Now that we have subsumes, I believe extension tags provides no real value and I would propose replacing standard tag and extension tags with just tag (which could be a standard tag or an extTag) and the existing subsumes.

To resolve ambiguity of different applications using the same extTag, a URI is required for use with subsumes, even for undocumented extension tags. A proposal to construct such a URI is:

If the tag is a documented extension tag, use the URI provided in the SCHMA
Else, if the HEAD.SOUR payload is itself a URI as suggested by https://gedcom.io/specifications/FamilySearchGEDCOMv7.html#HEAD-SOUR, construct the extension URI as: HEAD.SOUR payload / extTag
Else, construct the extension URI as, say: https://gedcom.io/terms/ext/ HEAD.SOUR payload / extTag

dthaler commented 10 months ago

@tychonievich comments?

tychonievich commented 10 months ago

I don't understand what problem you are trying to solve nor what problem you are finding with the current system. Maybe you are trying to create YAML files to help parse undocumented extensions? But I don't see how this proposed change actually helps solve the hard problem there, i.e. that same extTag is used to mean different things by different applications.

I added extension tags to the YAML with the thought that it would serve as a hint when picking tags for URI-identified structures during serialization, hopefully (a) increasing the human readability of the resulting files and (b) increasing the chance that incomplete implementations (ones that don't parse the schema) might treat the data correctly. I'm fairly confident that those are not the use cases you are referring to in this issue. I also realize that those purposes are not mention in the format definition and probably should be.

dthaler commented 10 months ago

I don't understand what problem you are trying to solve nor what problem you are finding with the current system. Maybe you are trying to create YAML files to help parse undocumented extensions?

Yes, that's one.

But I don't see how this proposed change actually helps solve the hard problem there, i.e. that same extTag is used to mean different things by different applications.

It solves it by using subsumes with application-specific URIs that can be derived from existing GEDCOM files without changes.

I added extension tags to the YAML with the thought that it would serve as a hint when picking tags for URI-identified structures during serialization,

I follow how extension tags can be used for anything useful during serialization. If there's only one, then it provides no value over just using tag as in my suggestion. If there's more than one, I don't follow how one would choose what to use, other than always choosing the first.

hopefully (a) increasing the human readability of the resulting files and (b) increasing the chance that incomplete implementations (ones that don't parse the schema) might treat the data correctly. I'm fairly confident that those are not the use cases you are referring to in this issue. I also realize that those purposes are not mention in the format definition and probably should be.

I can neither agree nor disagree, since I don't yet understand the use you suggest.

tychonievich commented 10 months ago

After some reflection and playing with some examples,

I agree that having a single tag and using subsumes could cover most use cases I had in mind when adding extension tags; for example _AKA and _AKAN could be given separate URIs and marked as subsuming one another instead of being stored in a single YAML.
A unified tag can't provide backup tags when merging files that use extension structures that conflict, like _TODO (see GEDCOM-L's list for how RootsMagic and WebTrees use that tag for different structures in the same context). That said, I'm not sure how valuable that use case is; picking a random tag or appending something to the expected tag might be enough.
When I added extension tags I had in mind possibly supporting some of the illegal standard tags for extension structures that various applications have added, like The Master Genealogist's ENMPL. But I never explored that further, and haven't thought through the pros and cons of allowing that in any detail, in part because I decided not to pursue YAML files for undocumented extensions. However, as you now are proposing those YAML files I'm thinking about that again and note that a separate key would be needed for this use case. Whether it should be extension tags or a new illegal standard tags or the like I don't have a strong opinion about.
A unified tag might accidentally imply that applications can assume that an undocumented extension tag translates to the given URI. I worry it might encourage applications to omit the schema because they think the tag is enough.
Unifying tag is conflating two semantics. A standard tag must identify a unique URI in its context, always means that URI in that context, and should be used in serialization. An entry in extension tags is just a presentation hint, can be changed to avoid tag name collisions, and is insufficient to determine URI by itself.

Currently we document those differences in the key of the YAML entry: standard tag has one meaning, extension tags the other. If we merge them we'd need to document those differences in terms of the form of the value of that key: if it starts with _ it has one meaning, if it doesn't it has a different meaning. I'd rather keep the semantic complexity in the keys, simplifying code and matching file structure to semantic intent, but it's a matter of choice.
Am I right in thinking that the proposed change does not enable any new functionality, only change how some situations are presented?

FamilySearch / GEDCOM.io

extension tags in YAML format #114