FamilySearch / GEDCOM

Apache License 2.0
166 stars 21 forks source link

Extension registry #204

Closed tychonievich closed 9 months ago

tychonievich commented 2 years ago

Now that we have a documented YAML format (https://gedcom.io/terms/format) I think it's time to revisit the idea of an extension registry.

Proposal: we allow community submissions of YAML files for extensions to a common repository where they may be easily located.

In addition to general interoperability gains, this might assist in an extension-to-standard workflow (#17) and in defining additional calendars (#38, #116) and events (#117).

Various questions I think we should answer before creating such a registry:

  1. Should it be part of this repo, the gedcom.io repo, or a different repo?

    Note that git hooks and github actions mean we can make this decision independently of if/where it has a web presence.

  2. How should submissions to the registry be managed? Options include

    1. Anyone can submit a pull request to the registry; a team (the steering committee or a different team) decides which to merge
    2. Tool authors first verify their ownership of a URI namespace; they can then submit definitions in that namespace only
    3. An open web form allows anyone to submit definitions; if they comply with the formatting rules they are accepted automatically
  3. How should files be named?

    Existing YAML files all have the same prefix so their filenames are easy to define. But that won't be true for extensions. Some options include

    1. \<tag>.yaml
    2. \<HEAD.SOUR-tag>.yaml
    3. incremental number: first submitted is 1.yaml, next is 2.yaml, and so on
    4. submitter chosen
    5. registry maintainer chosen
    6. file name based on URI (replacing /, ?, and # with other characters)
    7. directory tree based on URI (skipping scheme and replacing ?, and # with other characters)
    8. just one file with a list of all registry documents inside it
  4. What derivative files should be produced?

    1. Convert YAML to JSON, GEDC, XML
    2. TSV files like substructures.tsv and the others extracted from the standard
    3. Lists of all known URIs to use a given tag
    4. Lists of all known extensions to be produced by a given product
  5. Should the standardized concepts be included in the registry with the extension concepts?

  6. Should we create URIs in the extension tag registry namespace for extensions registered without a creator-defined URI?

dthaler commented 2 years ago

My initial thoughts...

  1. Should it be part of this repo, the gedcom.io repo, or a different repo?

No strong opinion here, but if some extensions might also warrant an addition to https://github.com/FamilySearch/GEDCOM/blob/main/version-detection/version-detection.md then having them be in the same repo would make PRs easier to review.

  1. How should submissions to the registry be managed? Options include

    1. Anyone can submit a pull request to the registry; a team (the steering committee or a different team) decides which to merge

Yes

  1. Tool authors first verify their ownership of a URI namespace; they can then submit definitions in that namespace only

I would like to see a process for 3rd party submissions, especially if there are known things used by popular apps that no longer have an active owner. As one comparable, a URI scheme can be registered by a third-party (see process in https://www.rfc-editor.org/rfc/rfc7595) and there is a process to later claim ownership.

  1. An open web form allows anyone to submit definitions; if they comply with the formatting rules they are accepted automatically

A web form sounds like more work, to create, maintain, update, so I'd just start with github PRs for now.

  1. How should files be named? Existing YAML files all have the same prefix so their filenames are easy to define. But that won't be true for extensions. Some options include

    1. .yaml
    2. <HEAD.SOUR-tag).yaml

Above sounds good to me.

  1. incremental number: first submitted is 1.yaml, next is 2.yaml, and so on
  2. submitter chosen
  3. registry maintainer chosen
  4. file name based on URI (replacing /, ?, and # with other characters)

I'd like to allow legacy (5.5.1) extensions in the registry, and those won't have URIs per se.

  1. directory tree based on URI (skipping scheme and replacing ?, and # with other characters)

  2. just one file with a list of all registry documents inside it

    1. What derivative files should be produced?
  3. Convert YAML to JSON, GEDC, XML

  4. TSV files like substructures.tsv and the others extracted from the standard

  5. Lists of all known URIs to use a given tag

  6. Lists of all known extensions to be produced by a given product

    1. Should the standardized concepts be included in the registry with the extension concepts?

Yes, I'd put them in the same registry.

  1. Should we create URIs in the extension tag registry namespace for extensions registered without a creator-defined URI?

Offhand, I might say no, but it's ok if we find a good argument to do so.

Norwegian-Sardines commented 2 years ago

I realize the question I'm asking is not really part of this issue thread but it has been bothering me since v7.0 was introduced.

Question: How would the Extension Registry actually work when used by a specific application?

For example: The genealogy application receives a V7.0 GEDCOM from the wild, i.e. a friend sends me a GEDCOM and wants me to help them research some of the branch. My program imports the GEDCOM and comes across an "Extension Tag" that it does not understand or know how to use it within its database. What happens?

How is this Extension Registry valuable to the import process?

tychonievich commented 2 years ago

How is this Extension Registry valuable to the import process?

I think of it primarily as an aid for developers. If a developer notices that a lot of files are being submitted with undocumented extension tag _XYZ or documented extension https://example.com/XYZ, the developer can check the registry to see what's known about that extension, including tips on where their code and/or database schema will have to change (via the superstructures field), how to parse the payload, descriptions of the extension's purpose and meaning, and suggested user interface text.

I expect it may also be used in some automated validators and compatibility checkers. I could imagine creating a tool that uses it directly to populate a dynamic user interface and automatically handle any extension in the registry, but I doubt that will be very common. Some tools also list tags they failed to import to the user, who could presumably use the registry to figure out what they all meant and thus exactly what data was lost, but I don't expect most users will do that.

Norwegian-Sardines commented 2 years ago

Thanks. Based on what I've seen with various genealogy programs the amount of effort put into updating a database, interface and additional data, I suspect that the incorporation of new "Extensions" will be slow!

Personally I would have hoped that a new record type would have been added to the GEDCOM payload working like a "Data Dictionary", similar to what was introduced in GEDCOM v5.3 but as a separate structure which could help the import reconcile the extension quicker and or give the user a specific message that would give a definition of the extension rather than an simple error.

tychonievich commented 2 years ago

Discussed in steering committee

dthaler commented 9 months ago

https://github.com/FamilySearch/GEDCOM-registries/pull/11 updated the GEDCOM-registries repository to copy the extracted files there automatically.

dthaler commented 9 months ago

This is now done