FamilySearch / GEDCOM

Apache License 2.0
164 stars 21 forks source link

grammar.gedstruct has infinite loops #424

Open augean opened 8 months ago

augean commented 8 months ago

in https://github.com/FamilySearch/GEDCOM/blob/main/extracted-files/grammar.gedstruct
we have types, that reference types going round in a loop , for example, look at the pattern we see a loop starting at line 569

hence, the schema has become an infinite loop, and we can't build from it GEDCOM 5.5.1 carefully avoided this type of error.

` line=35 " +3 <> {0:1}" in HEADER line=52 " +1 <> {0:1}" in HEADER line=569 " +1 <> {0:M}" in NOTE_STRUCTURE line=614 " +2 <> {0:1}" in SOURCE_CITATION line=623 " +1 <> {0:M}" in SOURCE_CITATION line=624 " +1 <> {0:M}" in SOURCE_CITATION line=569 " +1 <> {0:M}" in NOTE_STRUCTURE line=614 " +2 <> {0:1}" in SOURCE_CITATION line=623 " +1 <> {0:M}" in SOURCE_CITATION line=624 " +1 <> {0:M}" in SOURCE_CITATION line=569 " +1 <> {0:M}" in NOTE_STRUCTURE line=614 " +2 <> {0:1}" in SOURCE_CITATION line=623 " +1 <> {0:M}" in SOURCE_CITATION line=624 " +1 <> {0:M}" in SOURCE_CITATION line=569 " +1 <> {0:M}" in NOTE_STRUCTURE line=614 " +2 <> {0:1}" in SOURCE_CITATION line=623 " +1 <> {0:M}" in SOURCE_CITATION line=624 " +1 <> {0:M}" in SOURCE_CITATION line=569 " +1 <> {0:M}" in NOTE_STRUCTURE line=614 " +2 <> {0:1}" in SOURCE_CITATION line=623 " +1 <> {0:M}" in SOURCE_CITATION line=624 " +1 <> {0:M}" in SOURCE_CITATION line=569 " +1 <> {0:M}" in NOTE_STRUCTURE line=614 " +2 <> {0:1}" in SOURCE_CITATION line=623 " +1 <> {0:M}" in SOURCE_CITATION line=624 " +1 <> {0:M}" in SOURCE_CITATION line=569 " +1 <> {0:M}" in NOTE_STRUCTURE line=614 " +2 <> {0:1}" in SOURCE_CITATION line=623 " +1 <> {0:M}" in SOURCE_CITATION line=624 " +1 <> {0:M}" in SOURCE_CITATION line=569 " +1 <> {0:M}" in NOTE_STRUCTURE line=614 " +2 <> {0:1}" in SOURCE_CITATION line=623 " +1 <> {0:M}" in SOURCE_CITATION line=624 " +1 <> {0:M}" in SOURCE_CITATION line=569 " +1 <> {0:M}" in NOTE_STRUCTURE java.lang.RuntimeException: Recursive

`

Norwegian-Sardines commented 8 months ago

GEDCOM 5.5.1 carefully avoided this type of error.

I disagree, v5.5.1 has this issue as well.

augean commented 8 months ago

Actually, now you have jogged my memory,
When I created the machine readable ged.5.1.1.txt, I think I had to remove something to prevent infinite loops At least we agree, that this is still an issue., even if not a regression.

Norwegian-Sardines commented 8 months ago

The problem is that the term “source” is used incorrectly! The design and people say, the provider of a fact, object or note/statement is the “source” of the information, we should be using a citation referencing the provider. Therefore, an object, note, fact are cited, but a citation would never have an object, note or fact associated with it (no provider), so no circular reference! GEDCOM needs a citation record and deprecate the “Source_Citation” and the Source_Record!

tychonievich commented 8 months ago

The spec already identifies this case explicitly

A NOTE_STRUCTURE can contain a SOURCE_CITATION, which in turn can contain a NOTE_STRUCTURE, allowing potentially unbounded nesting of structures. Because each dataset is finite, this nesting is also guaranteed to be finite.

I do not see an issue here. This is as designed. Can you clarify what is wrong with it as it stands?

augean commented 8 months ago

To fix this you would need a new "NOTE_STRUCTURE_FOR_SOURCES"

"A NOTE_STRUCTURE can contain a SOURCE_CITATION, which in turn can contain a NOTE_STRUCTURE_FOR_SOURCES"

NOTE_STRUCTURE_FOR_SOURCES doesn't contain any SOURCE_CITATION's (preventing looping)

It would be very helpful if this was part of the spec.

tychonievich commented 8 months ago

The option of deep nesting is a feature that has been supported since NOTE.SOUR was added in version 5.5 (1996) and has valid semantic meaning. To remove it would be a breaking change, and thus could not happen before version 8.0 at the earliest.

That said, while I've never gotten to arbitrary depth I have used the SOUR.NOTE.SOUR.NOTE several times to express things that really benefit from this structure, and I'd be sad to see this removed from the spec. I'm also not seeing why we need to remove it, unless it is to support the specific operation of enumerating all possible paths.

A cleaned up example of how it can be used to convey valid semantic information is

1 BIRT
2 SOUR @S1@
3 NOTE The applicability of that source to this event has been challenged
4 SOUR @S2@
5 NOTE Several blogs have weighted in on the validity of this source's arguments in this case
6 SOUR @S3@
6 SOUR @S4@

...where S1 is the source record, S2 is an article questioning it, S3 and S4 are blog posts about if S2 applies to S1 or not. And there's no need for this to stop at any given depth: I might have notes about if S3 actually applies to my note on S2, and sources for that note, and so on.

Norwegian-Sardines commented 8 months ago

First let me say that I'm not advocating the removal of these relationship per se!

I have actually seen this in a GEDCOM I received!

An image O1 was added and some one wanted to give credit to the provider of the image, so they created a "Source of the Image" record S2. (An aside: This is why I indicated that images, notes, and facts should not have "sources" but should have a citation record that has no lower links).

S2 receives a NOTE N3 describing some additional information about the source.

At a later date a person looking at N3 decides that the NOTE needs a link to the image O1 because the NOTE refers to something in the image, and when displaying the NOTE N3 they want to see the image O1 along side the note.

And thus we have a loop.

Again I'm not advocating changing the current relationships, an object should have notes and notes should have objects, but something in the program implementing these relationship needs to stop O1 from having a link to N3 when N3 already has a link to O1! This is a simple check, but not so simple in the first example!

augean commented 8 months ago

thanks for the analysis, of this issue it seems like GEDCOM can have an infinite number of paths,
I will think it over, and how this could be implemented