FamilySearch / GEDCOM

Apache License 2.0
153 stars 20 forks source link

Machine readable schema - request #418

Open augean opened 5 months ago

augean commented 5 months ago

XML and JSON schemas are usually machine-readable Requesting the same for GEDCOM 7

Please see the attached files, for an example of a machine-readable schema that I created for GEDCOM 5.1.1 This allows me to create new GEDCOM files easily, and is very easy for tools to interact with

Please could we have a machine-readable GEDCOM 7 schema (please use the attached files as an example) I single file (or even multiple files), which allows us to easily parse the GEDCOM
ged.5.1.1.txt PrimSection.txt structure

thanks

tychonievich commented 5 months ago

7.0 has a machine-readable schema.

The line syntax is defined in the spec using ABNF, which is automatically extracted as grammar.abnf. At least one of the public gedcom parsers uses this grammar to parse lines.

The structure hierarchy is defined in the spec using a machine-readable variant of the metasyntax created for 5.0, which is automatically extracted as grammar.gedstruct. The structure hierarchy is also converted to a different machine-readable form as a set of YAML files hosted in several places including the URI of each structure type (e.g. https://gedcom.io/terms/v7/ABBR) and in a separate repository of both standard and extension structures (GEDCOM-registries). Multiple public gedcom parsers and development aids use one or both of these to parse and validate structure hierarchies.

These machine-parseable formats are not perfect (for example, we lack a machine-parseable way of marking something as deprecated) and we'd welcome suggestions in how to improve them. I did not look at your attached files closely enough to know if you have features the standard currently lacks.

dthaler commented 5 months ago

Discussion in GEDCOM Steering Committee 1/18/2024: We have machine readable schema. We have machine readable positive test cases in the GEDCOM.io repository. We currently don't have machine readable negative test cases, such as appear in PrimSection.txt Would others find that useful to have somewhere?

dthaler commented 5 months ago

Closing since original question has been answered, and follow-up discussion can be done in https://github.com/FamilySearch/GEDCOM/discussions/422

augean commented 5 months ago

1)All the comments are stripped out of the machine-readable schema The comments are VERY important to keep in, the schema is very difficult to use without comments (Please see ged.5.1.1.txt where I maintained the comments in the machine-readable form)

2)There is no machine-readable file with regular expressions, and comments defining the primitive types please see my PrimSection.txt , where I have the primitive types, along with descriptions and regular expressions (and examples !!)

3)The spec is fragmented across too many different files, making it very complex to parse (Please see attached, where I just used 2 files)

citing the above 3 reasons, I think the schema is not fully machine-readable -very important information like comments are left out of the machine-readable version -the regular expressions, which are critical are left out of any machine-readable version -the spec is fragmented across too many files.

Please review the attached ged.5.1.1.txt and PrimSection.txt which shows how the above issues could be fixed, and allow us to have a fully machine-readable GEDCOM 7 spec

augean commented 5 months ago

also, please advise, is it possible to reopen the issue? I don't want to make a nuisance of myself, but I think the underlying issues are not resolved (see above) At present issues are closed without any input from me, who originally logged the issue Github doesn't allow me to reopen Thanks !!!

tychonievich commented 5 months ago
    • Why are comments important for machine-readability? Is the machine reading them? How? What's the use-case that makes this important?
    • The YAML files have the specifications included in machine-readable form.
    • The entire specification itself, in both markdown and HTML formats, is also machine-readable, with the character-level and structure-level metasyntaxes inside markdown fenced code blocks with languages abnf and gedstruct and HTML pre elements with class="sourceCode abnf" and class="sourceCode gedstruct", respectively.
    • The character-level grammar, including of the detatypes we define, is in grammar.abnf. Several datatypes are not readily regex-ready (you yourself define a non-regex metasyntax "swapex"); we chose the industry-standard context-free grammar notation ABNF instead.
    • A few 7.0 additions (Media Type and Language) are defined in external specifications which we do not replicate to avoid the possibility of going out of sync with those standards. We also assume that any application that cares what format these have is also consulting those external standards anyway to understand their meaning.
    • All machine-readable parts are in two files: grammar.gedstruct and grammar.abnf.
    • If you want machine-readable copies of the human-targetted text and structure information in one file, you can get that by running cat extracted_files/tags/* > all.yaml.
    • If you want machine-readable copies of the entire spec in one file, you can get that by running cat specification/gedcom-*md > specification.md; character-level syntax is delimited by blocks that start "```abnf" and end "```" and structure-level metasytnax is is delimited by blocks that start "```gedstruct" and end "```"

We closed the issue because everything you asked for (machine-readability) is already provided. I still believe that's the case, but you've asked for more things (regular expressions and comments) so I'll re-open it for now to see if further conversation prompts identifying an issue that we should resolve.

augean commented 5 months ago

thanks for the feedback, I will take a further look But comments are very important, as they are used in genealogy tools, which are built off machine-readable schemas I just think that we should maintain the comments in the machine-readable version,

for example: in the Augean tool, I use comments extensively when editing GEDCOM

comments

augean commented 5 months ago

The YAML files work fine, thanks, I was able to parse all YAML files So please ignore my comment about too many files,

so, two issues would be

dthaler commented 5 months ago

Discussion 1/25/2024: We believe there are three separate issues worth discussing/pursuing here:

  1. The discussion above should be linked, e.g., on https://gedcom.io/tools/ since we're not sure the grammar.gedstruct or grammar.abnf files are easily discoverable
  2. Machine readable data type information should be discoverable for data types that aren't GEDCOM specific (e.g., IANA lang tags and media types). There may be machine readable mechanisms available on IANA that we can point to.
  3. It would be helpful to add user-facing descriptions to the YAML files, but today that's because none is in the GEDCOM spec and the YAML files are all derived from extracted information.

We don't think we need "regular expressions" per se because they can be derived from ABNF and because there are multiple different regex syntaxes used by various tools and libraries, so even if we picked one style, others would have to convert them anyway.

Please let us know if we are missing anything or if you have other feedback.

augean commented 5 months ago

User descriptions in YAML files will help a lot - thanks !!! Regular expressions in each YAML file would be the icing on the cake, but are not essential listing the files that are supposed to be machine-readable, will help as I was originally confused by this -thanks !!!