need better examples and common web space for SCHMA.TAG definitions

FamilySearch / GEDCOM

Apache License 2.0

171 stars 22 forks source link

need better examples and common web space for SCHMA.TAG definitions #175

Open albertemmerich opened 2 years ago

albertemmerich commented 2 years ago

Gedcom-L is discussing SCHMA.TAG, and does not come to a solution how to do it. We understand every extension tag needs its own description, and this must be linked by SCHMA.TAG payload. In the specification we find examples for SCHMA.TAG. If you try to use them as link, you get a 404 error. We are not shure, how the definitions of extension tags should look like and where to find the citations of the spec, and hope for some readable examples. Moreover: If every application is creating its own definitions for extension tags, merging their files will be a horror. So it would help a lot, if there is an common place in web where to put extension tag definitions and to use it by as many applications as possible. Only then we can be sure, that two extension tags have the same structure, as they have the same link. We do not see an automatic tool to compare two definitions of extension tags - created independently by different applications - and decide whether they describe the same tag or not. So importing extension tags will result in a mess of renamed tags if we cannot use same definitions for same purposes... Especially in cases where extension tag solutions are recommended until later GEDCOM specs will have an implicit solution, the recommendations should be documented by SCHMA.TAG definitions. Albert (Emmerich)

Norwegian-Sardines commented 2 years ago

"Moreover: If every application is creating its own definitions for extension tags, merging their files will be a horror. "

Yes I agree. The specification seams to imply that an extension is unique to the importing GEDCOM but does not take into account that any given software program will most likely be created from more than one imported GEDCOM files or have its own set of extensions that may (very likely) reuse the same extension tag with different meanings, locations, subtags and values/enumerations excepted for each tag.

It is also possible for similar definitions of extensions to use different tag names and a merge program would have no way to understand written definitions particularly if those definitions were written in different languages or call specific data values by different terms based on language or regional variations or the underlying understanding of a specific usage.

tychonievich commented 2 years ago

I think there are four potentially-separable issues being discussed here.

FOAF is a bad example. That true both because (a) it experienced link-rot since 7.0.0 was published and (b) it was never a GEDCOM extension at all, just a Semantic Web ontology. I agree that it would be nice to replace it with another example, preferably an actual 7.0 extension, but so far as I know, no 7.0 extensions have been published yet. To this end, I added an example extension to my personal repositories: https://tychonievich.github.io/gx-g7-names. Would something like this be a better example than FOAF?
It would be nice to have a single central repository of extensions as not all extension authors can be relied upon to maintain good documentation in the long term. I agree, such a repository would be nice. I think the exid-types.json example shows that we can usefully have a repository of material that is not in the specification itself. That said, the right medium for such a repository when it comes to extensions is not immediately obvious to me.
A misunderstanding about Tags vs URIs. Everything you say about disagreements and collisions between extension authors is true of 5.5.1 and earlier extensions, which are permitted in 7.0 for backwards compatibility as undocumented extensions; but 7.0's documented extensions are resistant to to such collisions because it is URIs, not tags, that define their meaning. Merging files from programs that used conflicting extension tags can be handled by changing the extension tags without changing the URIs.

I should note that the URIs in no way remove the possibility of a central repository. I could imagine, for example, a repository that consists of a three-column table: extension URI, link to extension author's documentation (if any, which might be the URI itself or another link), and link to repository's version of the documentation.
We have discussed cases where extensions are recommended until a later version of the spec. _LANG is in the spec itself; two other proposed extension recommendations were not implemented (#97 and #98)—so far as I know those are the only such examples. All three examples (the one used and the two rejected) are relocated standard structures with example SCHMA entries provided. @albertemmerich, are you suggesting these recommended extensions need additional clarity?

albertemmerich commented 2 years ago

re to Luther's no 3.: GEDCOM 7 has again a problem with matching extension tags. Yes, they are defined by the URI. But it will happen, that the same extension tag is defined by two authors using two different URIs. So it does not make sense to rename one of these tags, but to check whether the URIs define the same tag. If so, only use one of them. This problem will occur very often, if we do not offer the common web space. We have examples in German applications: For 5.5.1 we published an ADDENDUM defining a lot of extension tags. Many of them are still valid in 7.0 as GEDCOM 7.0 does not offer a structured solution. Examples _RUFNAME, _LOC,... As the Gedcom-L ADDENDUM is not prepared to link to one of extensions tags (it is not fragmented), it cannot be used as SCHMA.TAG payload. The authors take the definitions of the tags, put the definition of one tag to an own webspace and link SCHMA.TAG to it. Here you are: The same extension tag by structure, name, payload requirements, substructures. However different URIs per application :-(

So if Gedcom-L creates its ADDENDUM 7 version for the tags, do all other programs willing to use these tags really will find the source, and use it, too? I think, no: It might happen that they define the same tag again and build an UIRI on their own.

Gedcom-L authors feel the problem was not solved by GEDCOM 7, may be it is more hard, as it will be very difficult to decide whether two URIs describe the same tag...

to no 4.: we need a valid example for a tag which is not defined in GEDCOM 7. As _LOC is very complicated, and has a long definition with a lot of substructures, we should start with _RUFNAME.

albertemmerich commented 2 years ago

Let me give an example for an URI (URL) created before GEDCOM 7.0 was launched, but is used with GEDCOM 7.0 in the same way. The extension tag is: _IMPF This tag is used to link the record to the IMPorted File, and documenting the XREF of the record within that file, the author, and the date of import. By _IMPF we have an record <<_IMPF_RECORD>> and a calling <<_IMPF_STRUCTURE>>. You will find GEDCOM 7 files with this: 1 SCHMA 2 TAG _IMPF https://www.gen-do.de/Addendum/impf.html If you follow that link, you find the definition written in German. If any application in the English speaking area wants to use this tag, do you think they will use the same URI (URL), or do you think they will have an own definition translated to English?

Norwegian-Sardines commented 2 years ago

My first question would be.

if I was updating my application and decided to create an extension tag for “immigrant profile” and call it _IMPF.

How the heck would I know to even look at the German site to see that the tag was already in use. Even if I was creating an extension for my application for “Import File” and wanted to call the tag _IMPFL or even _IMPF, I still have no clue to even look for the tag elsewhere or that I wanted several additional subtags for _TYPE, _IMPDT and _IMPID!

dthaler commented 2 years ago

My take on the latest questions:

it will happen, that the same extension tag is defined by two authors using two different URIs. So it does not make sense to rename one of these tags, but to check whether the URIs define the same tag. If so, only use one of them. This problem will occur very often

Probably all true except for the "it does not make sense to rename one of these tags". Any program can rename then at any time on a per-file basis and according to the current spec that makes sense as long as the URI remains constant. I agree with Albert that very often apps would just use one of them, even if the alternative does make sense. So when importing a file from another application, one really must be prepared to either rename extension tags (preserving the URI), or drop them.

If you follow that link, you find the definition written in German. If any application in the English speaking area wants to use this tag, do you think they will use the same URI (URL), or do you think they will have an own definition translated to English?

I think I would use the same URI if I meant the same thing. However if I couldn't understand the definition because it's written in a language I can't understand or find a good translator for, then I'd just define my own and not care about matching it. In that sense I would agree with this:

How the heck would I know to even look at the German site

My main takeaway from the spec is that when defining documented extensions according to the FamilySearch GEDCOM 7 spec, one should not assume that the documented extension tag must be unique or that one can look up a meaning for a given tag per se. Only the URI must be unique, and it currently does not require the ability to even find any documentation (whether in a language you know or not). It need not be a URL, and it need not have any way to find any actual documentation. Of course it is a lot more useful if it does, but it's not illegal if it doesn't.

dthaler commented 2 years ago

I also think that what some people are really looking for is a way to register extensions (and especially documentation for them) on some central site like gedcom.io for others to use to look up what has been defined and used by others.

Norwegian-Sardines commented 2 years ago

You are thinking in terms of the user looking at a GEDCOM and determining what a tag means. BUT what about the developer of the import routine? They would most likely either kick the tag(s) to an error report if they don’t understand it and wash their hands of the whole thing or if their program think it knows the tag(s), plop the data into some structure that probably does not make any sense to the user. Neither solution works for the end user.

And then there are the extension tags that are not really extensions but misunderstanding of GEDCOM and creating an extension that could be covered under a standard tag. I’ve seen this with many current GEDCOM imports, particularly with sourcing, but also creating new fact/events when a subtag already exists!

dthaler commented 2 years ago

You are thinking in terms of the user looking at a GEDCOM and determining what a tag means. BUT what about the developer of the import routine? They would most likely either kick the tag(s) to an error report if they don’t understand it and wash their hands of the whole thing or if their program thing it knows the tag(s), plop the data into some structure that probably does not make any sense to the user. Neither solution works for the end user.

If "you" means me, I'm thinking about it as a developer of an application that uses GEDCOM as its native format. I would use the SCHMA.TAG to get the URI and use that to look up the correct entry in my tag table, in my import routine. I haven't done that yet in my app, but it's on my list to do as I support more GEDCOM 7 features, and I understand how to do it. (And this thread may motivate me to do so real soon now.)

Norwegian-Sardines commented 2 years ago

If "you" means me, I'm thinking about it as a developer of an application that uses GEDCOM as its native format. I would use the SCHMA.TAG to get the URI and use that to look up the correct entry in my tag table, in my import routine.<

But the work of looking it up would be done after you wrote the code for the import and some user of yours two months or a year later gets a GEDCOM with an extension tag that you had not coded for. That user would either lose data or have data dropped into the wrong bucket!

dthaler commented 2 years ago

If "you" means me, I'm thinking about it as a developer of an application that uses GEDCOM as its native format. I would use the SCHMA.TAG to get the URI and use that to look up the correct entry in my tag table, in my import routine.<

But the work of looking it up would be done after you wrote the code for the import and some user of yours two months or a year later gets a GEDCOM with an extension tag that you had not coded for. That user would either lose data or have data dropped into the wrong bucket!

It sounds like you're assuming one has to write explicit code for each extension tag. Not all apps do that. Some (like mine, but I'm aware of others) preserve extension tags without knowing the meaning, e.g., just displaying the payloads to users, or preserving them when exporting. In neither case is data lost or dropped into the wrong bucket per se. I think worst case is just that the data becomes out of date if it's not modified when something else changes that it is supposed to be consistent with.

But that just leads to a recommendation (not a requirement) that extension tags not have values that have to be updated based on something else in the database. One might have rules like "if the payload of a given structure changes, remove all unknown extension substructures under it" which can help in some cases.

Norwegian-Sardines commented 2 years ago

It sounds like you're assuming one has to write explicit code for each extension tag. Not all apps do that. Some (like mine, but I'm aware of others) preserve extension tags without knowing the meaning, e.g., just displaying the payloads to users, or preserving them when exporting.<

Most of the mainstream programs (if not all) convert the incoming GEDCOM into a database that in most cases will not have a place to put all GEDCOM data. This occurs today, and I don’t see them normalizing a database that supports all GEDCOM tags.

While your program may be a “Display as import”, these mainstream programs and a large majority of others are not. The software I’m associated with both translates GEDCOM tags to actual words as well as translates the tags to multiple languages. For Example: BIRT = Birth in English, Fødsel in Norwegian, Geburt in German, γέννηση in Greek. Etc.

Standard Enumeration would also be translated as well!

Norwegian-Sardines commented 2 years ago

But actually I was responding to this statement, and the “you” was a reference to the programmer of most programs that interpret, internationalize, populate a database field, or otherwise display the tag code (ie, _IMFL) as a real word or concept!

A misunderstanding about Tags vs URIs. Everything you say about disagreements and collisions between extension authors is true of 5.5.1 and earlier extensions, which are permitted in 7.0 for backwards compatibility as undocumented extensions; but 7.0's documented extensions are resistant to to such collisions because it is URIs, not tags, that define their meaning. Merging files from programs that used conflicting extension tags can be handled by changing the extension tags without changing the URIs.<

URIs are still a problem with a merged file if you changed the tag name when there are collisions or just general use because “you” (the program) would never know what the incoming tag means, where to store it in the database that most mainstream application use, or how to internationalize the display name.

I fail to understand how the following URI examples can tell an importing program what the tag means.

2 TAG _GX_TITLE http://gedcomx.org/Title 2 TAG _GX_PRIMARY http://gedcomx.org/Primary 2 TAG _GX_SECONDARY http://gedcomx.org/Secondary

ghost commented 2 years ago

The name of the tag as used in the GEDCOM has no intrinsic significance, and so could be renamed with no consequences. For instance, a simple suffix to differentiate similar names, such as custom_1 and custom_2 where "custom" was a clashing tag name.

The alternative using namespace prefixes (e..g. my:tag versus your:tag) would be bullet-proof but was universally voted against (I was the only advocate) because namespaces are too complicated for many developers.

Tony

On 18/07/2022 23:21, Dave Thaler wrote:

My take on the latest questions:
it will happen, that the same extension tag is defined by two authors using two different URIs. So it does not make sense to
rename one of these tags, but to check whether the URIs define the same tag. If so, only use one of them.
This problem will occur very often
Probably all true except for the "it does not make sense to rename one of these tags". Any program can rename then at any time on a per-file basis and according to the current spec that makes sense as long as the URI remains constant. I agree with Albert that very often apps would just use one of them, even if the alternative does make sense. So when importing a file from another application, one really must be prepared to either rename extension tags (preserving the URI), or drop them.
If you follow that link, you find the definition written in German. If any application in the English speaking area
wants to use this tag, do you think they will use the same URI (URL), or do you think they will have an own definition
translated to English?
I think I would use the same URI if I meant the same thing. However if I couldn't understand the definition because it's written in a language I can't understand or find a good translator for, then I'd just define my own and not care about matching it. In that sense I would agree with.
How the heck would I know to even look at the German site
My main takeaway from the spec is that when defining documented extensions according to the FamilySearch GEDCOM 7 spec, one should not assume that the documented extension tag must be unique or that one can look up a meaning for a given tag per se. Only the URI must be unique, and it currently does not require the ability to even find any documentation (whether in a language you know or not). It need not be a URL, and it need not have any way to find any actual documentation. Of course it is a lot more useful if it does, but it's not illegal if it doesn't.

— Reply to this email directly, view it on GitHub https://github.com/FamilySearch/GEDCOM/issues/175#issuecomment-1188381275, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDJB3VVISWRNJ2DW2JKOLDVUXKHNANCNFSM53YPOZPQ. You are receiving this because you are subscribed to this thread.Message ID: @.***>

albertemmerich commented 2 years ago

Tony, the name of the extension tag may be renamed. No problem, if URI would work. The problem is the mapping of the URIs to datafields within the application. I have to map the URI to the internal datafields to show the content to the user. If I have a mapping of one URI to my internal _IMPF data field, and I get the next URI from another GEDCOM file I have to check, whether this can be mapped to the same data field. Only after mapping I can provide the data with correct structure on my user interface.

The more different URIs we have for the same element, the more checkings I have to do. I do not see an automatic checking. This may result in different _RUFNAME datafields in the program as long I could not identify the different URIs for it.

In the 5.5.1 world we have an inofficial list of names for extension tags, and with it the information which programs are using these tags. That could be handled. GEDCOM 7 is destroying this system, as we do not know how every application will build their own URI for any tag of this 5.5.1 list, and maybe applications are starting to rename the tags according differnet URI systems. Maybe the better solution is to stay with the 5.5.1 solution and use in GEDCOM 7 versions non documented tags as URI will not help. This is what I am doing so far in my application with all extension tags I know several programs are using. So far I am ignoring any SCHMA.TAG at import as it does not help me.

Sorry - I (and with me my German colleagues) do not understand, how an application can merge GEDCOM files from different other applications using different URI systems... We would need a central register to map URI to extension tags which our applications are supporting or which map the URIs of one application to the URIs of next application. Within my application the automatic mapping is only to my own URI, as I defined it.

Therefore we are asking for one URI web space, to avoid an amount of new application specific tag URIs which cannot be handled any more.

Albert

ghost commented 2 years ago

I am not 100% clear on your argument, Albert. Is it simply that you want to display GEDCOM data to the end-user, and so want a mapping of URI -> recognisable tag name (not annotated, or totally renamed)?

If this argument was presented by two different vendors using the same tag then there would be no solution.

The URI scheme was added to prevent clashes in 5.5.1 that would be totally ambiguous, and so unworkable.

I would argue that 95% of end-users wouldn't understand GEDCOM, even if it fell on their foot, but I acknowledge that you must have some application requirement for it. I apologise because I don't use any public software, other than my own.

Tony

On 19/07/2022 11:13, Albert Emmerich wrote:

Tony, the name of the extension tag may be renamed. No problem, if URI would work. The problem is the mapping of the URIs to datafields within the application. I have to map the URI to the internal datafields to show the content to the user. If I have a mapping of one URI to my internal _IMPF data field, and I get the next URI from another GEDCOM file I have to check, whether this can be mapped to the same data field. Only after mapping I can provide the data with correct structure on my user interface.

The more different URIs we have for the same element, the more checkings I have to do. I do not see an automatic checking. This may result in different _RUFNAME datafields in the program as long I could not identify the different URIs for it.

In the 5.5.1 world we have an inofficial list of names for extension tags, and with it the information which programs are using these tags. That could be handled. GEDCOM 7 is destroying this system, as we do not know how every application will build their own URI for any tag of this 5.5.1 list, and maybe applications are starting to rename the tags according differnet URI systems. Maybe the better solution is to stay with the 5.5.1 solution and use in GEDCOM 7 versions non documented tags as URI will not help. This is what I am doing so far in my application with all extension tags I know several programs are using. So far I am ignoring any SCHMA.TAG at import as it does not help me.

Sorry - I (and with me my German colleagues) do not understand, how an application can merge GEDCOM files from different other applications using different URI systems... We would need a central register to map URI to extension tags which our applications are supporting or which map the URIs of one application to the URIs of next application. Within my application the automatic mapping is only to my own URI, as I defined it.

Therefore we are asking for one URI web space, to avoid an amount of new application specific tag URIs which cannot be handled any more.

Albert

— Reply to this email directly, view it on GitHub https://github.com/FamilySearch/GEDCOM/issues/175#issuecomment-1188866720, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDJB3SZXWNOUSJQD2WUPL3VUZ5UXANCNFSM53YPOZPQ. You are receiving this because you commented.Message ID: @.***>

Norwegian-Sardines commented 2 years ago

In version 5.3 of the GEDCOM specification the following was used:

USER_TAG_SCHEMA:= n <> +m LABL +m DEFN +m ISA

With an example of: n SCHEMA +1 INDI +2 BIRT +3 _HOSP +4 LABL +4 DEFN +4 ISA +3 _NURSE +4 LABL +4 DEFN +4 ISA

An incoming GEDCOM “from the wild” would have the following information so that any program could better understand and either fully display the new tag or report an unknown value to the user.

0 HEAD
1 SCHEMA
2 INDI
3 BIRT
4 _HOSP
5 LABL Hospital
5 DEFN The hospital name a person was born in.
5 ISA Text

NOTE: I’m not necessarily advocating for this specific solution but it gives a better and more useful definition of the incoming, “from the wild”, extension tag for a load program where they either have a defined database structure or programs that use GEDCOM directly. The extension tag could have been “_XYZ2” with the same LABL of “Hospital” and an import program would still have a fighting chance of displaying the data with an understandable label and/or provide the user importing into their program a useable error message describing the unknown data.

albertemmerich commented 2 years ago

Tony wrote: "I am not 100% clear on your argument, Albert. Is it simply that you want to display GEDCOM data to the end-user, and so want a mapping of URI -> recognisable tag name (not annotated, or totally renamed)?"

Let me try again:

What we all need is to understand what infirmation the extension tag is carrying. Example Rufname: If I have a datafield for "Rufname" in my application, and I am using an extension tag _RUFNAME to export it, then I need to know at import, whether an URI is defining this Rufname, and I may rename the incoming extension tag to _RUFNAME. If my application cannot map the incoming tag by interpreting the URI coming with it, this mapping does not work.

A lot of applications are using _RUFNAME in the same way, and with 5.5.1 we import it by mapping our Rufname datafield to the tag name _RUFNAME. In GEDCOM 7.0 we will have the situation that the same data are coded using an URI defined by the application (specific for that application) and the tag name is no longer defining the tag, but the URI. The we do not have an agreed mapping URI <=> Rufname in between the applications as we had with _RUFNAME <=> Rufname. The tag maybe renamed to _EXTTAG00078 or so during any import.

What we see as the problem with GEDCOM 7, that the applications should create URI to define extTags in SCHMA.TAG and and cannot use a system of existing URIs (may be on a central web space). This will result in creating an own URI per application describing the data (i.e. for Rufname) with their words, in their lanuguage. How will we be able to interpret this mess of incoming URIs, especially if there is arbitrary renaming of the tags during import into programs and the name of the tag does not help any more?? You must rely on the URI, but there is no idea how an application interpretable URI should look like...

ghost commented 2 years ago

I think I understand, Albert. I apologise for being a little slow.

The idea was that the URI uniquely defines the tag semantics, and so the mapping you outlined below should always work (i.e. no one else should mess with a proprietary URI).

However, I think you are saying that several vendors may have a _RUFNAME extension, but with their own URIs, and you are expecting to catch them all. This doesn't sound like a good idea, and is one of the reasons I was hoping that RUFNAME would be accepted as a standard tag -- there would then be no ambiguity when looking at the URI.

Because of the commonality of the _RUFNAME extension then we need a single URI, either defined by FS or you group over there, Albert ... IMHO of course.

Tony

On 19/07/2022 18:23, Albert Emmerich wrote:

If my application cannot map the incoming tag by interpreting the URI coming with it, this mapping does not work.

tychonievich commented 2 years ago

Discussed in steering committee meeting.

In many cases, we can continue to use undocumented extension tags for existing extensions; per the spec, these have a meaning defined by their tag name. This will not always work, though, as some tags have collisions (e.g. _PLAC, _TODO) and would need a URI to be disambiguated. Because of these collisions, and for future collision avoidance, creating URIs for existing extensions may be wise.
We would like to create a repository of extensions (documented and not) like we have for EXID.TYPE; the details of what this should look like are not yet obvious. We expect we could benefit from what GEDCOM-L has done at https://wiki.genealogy.net/GEDCOM/_Nutzerdef-Tag. This repository could help prevent multiple URIs being defined for the same purpose.

Some of the other topics discussed above may be better pulled out into separate issues

Norwegian-Sardines commented 2 years ago

Guys, The talk of URI is just not answering the question I’m asking, what in a URI tells me anything about the meaning of the tag for a program that is importing an exTag “from the wild” generated by another program? Tony has said he understands that many programs that base their exTag on GEDCOM-L and call their exTag _RUFNAME may have their own URI so a standard URI should be created to support all of them. BUT in the wild my program can’t magically know what the URI means and the data will either be dropped or mislabeled. The URI does not create a disambiguating state, all it does is say, “dear importing program, I’ve sent you a piece of data that you don’t know anything about, figure it out on you own!”

Would someone please answer this basic question?

I’ve worked with Data Dictionaries and data interchange software in the past. The inclusion of a Data Dictionary from the GEDCOM creating to the GEDCOM receiving program as part of the gedZip file would be a great advantage!

What if my program generated a GEDCOM that had an exTag of _xyzRUF but it still meant Rufname how would any receiving program figure out what to do with the information based on your magical URI?

Norwegian-Sardines commented 2 years ago

Albert said in his initial entry:

We do not see an automatic tool to compare two definitions of extension tags - created independently by different applications - and decide whether they describe the same tag or not. So importing extension tags will result in a mess of renamed tags if we cannot use same definitions for same purposes...<

And I still don’t see an answer! How or where does the URI help with this question?

albertemmerich commented 2 years ago

steering committee stated yesterday: "2. We would like to create a repository of extensions (documented and not) like we have for EXID.TYPE; the details of what this should look like are not yet obvious."

Let's look at URI which were defined together with GEDCOM 7 spec, like https://gedcom.io/terms/v7/PLAC Those URIs define the meaning of the tag, the super- and substructures, and explain some special situations. If we had a repository with URIs like this defining extension tags, too, applications are able to map the URI to their internal structure. For backward compatibility reasons we should assign a "recommended name" to the extension tags in 7.0, so the 5.5.1 way via defining tags by name (see https://wiki.genealogy.net/GEDCOM/_Nutzerdef-Tag) will be a starting point and the URI definitions could link the so far established extension tag names like _RUFNAME with URIs clearly defining its purpose and structures. This documentation put in a repository which is referenced in the official GEDCOM 7.0 documents will help a lot to avoid multiple URIs for same purpose, will help the applications to identify the exTags / URIs they support, and will be a starting point for discussions whether to add an extension tag as standard tag in next standard versions.

First step should be defining the internal structure of URIs and whether we will make them computer-readable for automatic interpretation by applications or we stay with a well structured form as used for GEDCOM 7 standard tags.

The repository could take definitions for 5.5.1 extensions, too. So the URI content should show for which standard versions the extension tag is defined.

For well documented extension tags as those defined and used by Gedcom-L group it is possible to transform existing definitions (see https://genealogy.net/GEDCOM/) so this can be a rather quick starting point for the repository.

albertemmerich commented 2 years ago

the following is a draft for an URI for extension tag _RUFNAME:

%YAML 1.2

type: structure

uri: #####/v7/_RUFNAME

extension tag, recommended tag name: _RUFNAME

descriptions:

Rufname (German)

the underlined given name within multiple given names of a person in an official (German) document

payload: http://www.w3.org/2001/XMLSchema#string the payload must be one of the given names in payload of NAME, TRAN or GIVN

substructures: []

superstructures: "https://gedcom.io/terms/v7/INDI-NAME": "{0:1}" "https://gedcom.io/terms/v7/NAME-TRAN": "{0:1}" ...

It could be published in an official repository ##### (tbd).

ghost commented 2 years ago

I don't like this -- not that this is going to influence people.

The whole object of using URIs was to unique identify the tags, and avoid the clashes where raw extension tags were previously used. Hence, by that rationale, if _RUFNAMe (for instance) has a common definition and interpretation among a group of providers then there should be a single accepted URI used by those providers. Trying to reverse-engineer URIs to old-style tag-names is completely against the design goal of GEDCOM 7.

As for repositories, documented functionality, and (in the future) enhanced meta-data for extension tags, not all tags are defined to be interpreted by other software. This may be obvious but I just want to mention it. My product has about 5 or 6 extension tags that are used simply to persist data that my program supports, and that is of no use to other software.

Tony

On 20/07/2022 10:02, Albert Emmerich wrote:

steering committee stated yesterday: "2. We would like to create a repository of extensions (documented and not) like we have for EXID.TYPE; the details of what this should look like are not yet obvious."

Let's look at URI which were defined together with GEDCOM 7 spec, like https://gedcom.io/terms/v7/PLAC https://gedcom.io/terms/v7/PLAC Those URIs define the meaning of the tag, the super- and substructures, and explain some special situations. If we had a repository with URIs like this defining extension tags, too, applications are able to map the URI to their internal structure. For backward compatibility reasons we should assign a "recommended name" to the extension tags in 7.0, so the 5.5.1 way via defining tags by name (see https://wiki.genealogy.net/GEDCOM/_Nutzerdef-Tag https://wiki.genealogy.net/GEDCOM/_Nutzerdef-Tag) will be a starting point and the URI definitions could link the so far established extension tag names like _RUFNAME with URIs clearly defining its purpose and structures. This documentation put in a repository which is referenced in the official GEDCOM 7.0 documents will help a lot to avoid multiple URIs for same purpose, will help the applications to identify the exTags / URIs they support, and will be a starting point for discussions whether to add an extension tag as standard tag in next standard versions.

First step should be defining the internal structure of URIs and whether we will make them computer-readable for automatic interpretation by applications or we stay with a well structured form as used for GEDCOM 7 standard tags.

The repository could take definitions for 5.5.1 extensions, too. So the URI content should show for which standard versions the extension tag is defined.

For well documented extension tags as those defined and used by Gedcom-L group it is possible to transform existing definitions (see https://genealogy.net/GEDCOM/ https://genealogy.net/GEDCOM/) so this can be a rather quick starting point for the repository.

— Reply to this email directly, view it on GitHub https://github.com/FamilySearch/GEDCOM/issues/175#issuecomment-1190017442, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDJB3SBDIOJMQIDVVQM7ETVU66BFANCNFSM53YPOZPQ. You are receiving this because you commented.Message ID: @.***>

Norwegian-Sardines commented 2 years ago

ACProctor said:

My product has about 5 or 6 extension tags that are used simply to persist data that my program supports, and that is of no use to other software.

Then why send it to me in a GEDCOM?

I’ve always been against none-standard tags (and the inclusion in the specification since the early 1980’s) being sent in a GEDCOM and more importantly ones that no genealogical meaning, (I’m thinking of tags for example: defining photo zoom and face recognition or printing instructions). Programs should always send only standard GEDCOM, defined in the specific release of the Standard. The use of “extensions” in the transfer of data between two dissimilar programs should be eliminated because in general the data will be most likely dropped or misused by the receiving program because they will not understand the use!

I get that programs need to have additional data points in their application database to support concepts and enhancements they implement, but sending that data to an unknown program and thinking for a minute that it will be understood is ridiculous!

I’d rather see GEDCOM be a living, breathing concept that thru knowledgeable genealogists gets periodic updates that fix missing data-points.

If Rufname is an important concept that is unique in its self then it needs to be added to the NAME subtag set. However, if Rufname can be normalized to a common term such as ‘Call Name’ with a TYPE tag of ‘Rufname’, where the new ‘Call Name’ subtag can also have a TYPE of ‘Preferred’, ‘Nick’, ‘Other’ (for example), then I think this would be a better solution, and easier to expand.

ghost commented 2 years ago

They have meaning in my genealogical product, and so are important if the file gets loaded back into my product after sharing with other users of it.

So, I'm not "sending to you" and you can simply ignore it. I do not acknowledge any problem here.

Tony

On 11/10/2022 14:26, Norwegian-Sardines wrote:

ACProctor said:
My product has about 5 or 6 extension tags that are used simply to persist data that my program supports, and that is of no use to other software.
Then why send it to me in a GEDCOM?

I’ve always been against none-standard tags being sent in a GEDCOM and more importantly ones that no genealogical meaning, (I’m thinking of tags for example: defining photo zoom and face recognition or printing instructions). Programs should always send only standard GEDCOM, defined in the specific release of the Standard. The use of “extensions” in the transfer of data between two dissimilar programs should be eliminated because in general the data will be most likely dropped or misused by the receiving program because they will not understand the use!

I get that programs need to have additional data points in their application database to support concepts and enhancements they implement, but sending that data to an unknown program and thinking for a minute that it will be understood is ridiculous!

I’d rather see GEDCOM be a living, breathing concept that thru knowledgeable genealogists gets periodic updates that fix missing data-points.

If Rufname is an important concept that is unique in its self then it needs to be added to the NAME subtag set. However, if Rufname can be normalized to a common term such as ‘Call Name’ with a TYPE tag of ‘Rufname’, where the new ‘Call Name’ subtag can also have a TYPE of ‘Preferred’, ‘Nick’, ‘Other’ (for example), then I think this would be a better solution, and easier to expand.

— Reply to this email directly, view it on GitHub https://github.com/FamilySearch/GEDCOM/issues/175#issuecomment-1274685662, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDJB3S2K6VRXFO2AC4AWW3WCVTKDANCNFSM53YPOZPQ. You are receiving this because you commented.Message ID: @.***>

Norwegian-Sardines commented 2 years ago

ACProctor said:

So, I'm not "sending to you" and you can simply ignore it. I do not acknowledge any problem here.

If your GEDCOM comes to me for merging into another application then it will need to parse out GEDCOM rather than reading all of the GEDCOM into a record! So while you don’t acknowledge any problem, it is a problem for the receiving program.

If extensions to GEDCOM are allowed, then all extensions need to be parse by the receiving program and resolved for display and or reassignment.

I would hope that programs that generate extensions would have the option of not generating those extensions.

Just like I would hope that programs would stop using extensions that have perfectly good Standard tags. _MARNM comes to mind. This should generate a second NAME tag with a TYPE of married ( a v5.5.1 reference).

ghost commented 2 years ago

As I said, this particular case of extension tags can be ignored by other products, as was the case in older versions. They are single records with plain-text payloads, which makes it a trivial operation.

Tony

On 11/10/2022 15:32, Norwegian-Sardines wrote:

ACProctor said:
So, I'm not "sending to you" and you can simply ignore it. I do not acknowledge any problem here.
If your GEDCOM comes to me for merging into another application then it will need to parse out GEDCOM rather than reading all of the GEDCOM into a record! So while you don’t acknowledge any problem, it is a problem for the receiving program.

If extensions to GEDCOM are allowed, then all extensions need to be parse by the receiving program and resolved for display and or reassignment.

I would hope that programs that generate extensions would have the option of not generating those extensions.

— Reply to this email directly, view it on GitHub https://github.com/FamilySearch/GEDCOM/issues/175#issuecomment-1274793044, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACDJB3TOMOSX3X7H5GFQOKDWCV3ARANCNFSM53YPOZPQ. You are receiving this because you commented.Message ID: @.***>

Norwegian-Sardines commented 2 years ago

As such it would also be a trivial operation to not send the extensions in the first place.

I’m reminded of a guiding principle of database design and program development that to never generate data that is invalid or undefined. One of the primary reasons “Better GEDCOM” and its successor FHISO were created and much needed, was to standardize the sharing and transfer of genealogical data between applications. By including undefined and/or non-genealogical data in a GEDCOM, the value of the GEDCOM is reduced.

However, since you will probably continue sending these tags, we will have to agree to disagree! I’m out on this specific topic!

albertemmerich commented 2 years ago

Any application not using / interpreting extension tags has no problem in deleting all extension tags at import. That is only very small parsing, as these tags start with underscore, and by that can be identified easily. However I cannot accept that all other application should be limited to the standard tags and by this could not exchange more data as so far defined by standard tags. GEDCOM 7.0 was a small step compared to the actual so far exchanged data - there are a lot of extension tags in the wild, and many are shared by several applications. Yes, it is a goal to more and more standardize these data exchanges. But on the other hand I doubt many applications would follow a new standard covering all (or most) of the known extension tags - the modification of the standard would be huge. So for a long time we need standard tags. And yes, my application shurely will export them. _RUFNAME may be solved by a new version of NAME tag, and its subtags. That is on the way with the activities round NAME. What about _LOC tag, and the location records? This is a bigger step. But needed, too. I know three versions in the wild to define location records. My application prefers the most far reaching version of GEDCOM-L, but can read and export the other versions, too. We need to define a standardized version for this, too. This was not possible in the short time of developping GEDCOM 7.0, may be we see it in GEDCOM 8.0? But there still will be other structures which remain open...

For my application I know which data structures I cover. And in most cases I know how other programs export them, if they cover them, too. Then it is no big step to ignore the unknown/uncovered tags - and to work to understand them, if there are indications that they might be useful for my application. But that is far from having a standard for all of this stuff.

So, I am in favor of Tony's position. And I am in favor of defining more and more standard structures in next GEDCOM versions for data which today are exchanged by extension tags...

Norwegian-Sardines commented 2 years ago

What about _LOC tag, and the location records? This is a bigger step.

Yes, my suggestion would be to develop a tag and record similar to NOTE and SNOTE. Call them PLAC and SPLAC! And see if it can be added in at v7.1. I would like this option as well!

I’m in favor of defining more standard structures as well, but sending a structure (for example) _MARNM when a perfectly good standard structure is also available should be a no-no.

I’ve see cases were an extension is used (for example) _HUSB in a same sex marriage so that the code can determine the sex of the partner rather than looking up the sex in the INDI record. If I export to this program and they expect a _HUSB, but I send them a WIFE they would identify that man as a female. Or they send me a GEDCOM with a _HUSB and I drop the connection in my FAM record. That would cause issues, bidirectional links are broken, and a family record losses a partner.

The above is just an example of where extensions that relate to functionality go off kilter! I don’t need a solution!

jl5000 commented 1 year ago

@Norwegian-Sardines I completely agree with everything you say. I share your concern. This has the potential to get very messy very quickly.