FamilySearch / GEDCOM

Apache License 2.0
159 stars 20 forks source link

Implement reusable CITAtions by using pointers to new CITA structure #348

Open reteP-riS opened 11 months ago

reteP-riS commented 11 months ago

Currently the definition is:

SOURCE_CITATION :=
n SOUR @<XREF:SOUR>@ {1:1} g7:SOUR
+1 PAGE <Text> {0:1} g7:PAGE
+1 DATA {0:1} g7:SOUR-DATA
+2 <<DATE_VALUE>> {0:1}
+2 TEXT <Text> {0:M} g7:TEXT
+3 MIME <MediaType> {0:1} g7:MIME
+3 LANG <Language> {0:1} g7:LANG
+1 EVEN <Enum> {0:1} g7:SOUR-EVEN
+2 PHRASE <Text> {0:1} g7:PHRASE
+2 ROLE <Enum> {0:1} g7:ROLE
+3 PHRASE <Text> {0:1} g7:PHRASE
+1 QUAY <Enum> {0:1} g7:QUAY
+1 <<MULTIMEDIA_LINK>> {0:M}
+1 <<NOTE_STRUCTURE>> {0:M}

When citing a marriage record from an old church book I prefer to record the complete text - sometimes even translate it - and put it into +2 TEXT <Text> {0:M} g7:TEXT. Because marriage records often mention 6 or even more individuals (bride, groom, bride's parents, grooms's parents, witnesses) with information about birth date, birth place, age, religion, occupation, place of residence, etc. I have to copy all the details from one SOURCE_CITATION to the other. This may lead to up of 20 or more copies of identical information. If I later find out that I misread some of the old handwriting I have to go through all these copies and correct them which is not only a pain but error prone.

I'd like to use reusable citations by using pointers to a new GEDCOM level 0 structure e.g. CITA that would then point to the SOUR record which points to the REPO record.

Instead of

1 OCCU butcher
2 DATE ABT 1850
2 PLAC New York City, New York, USA
2 SOUR @S123@
3 PAGE 456
3 DATA
4 TEXT ... butcher ...
5 CONT ... New York City ...

use something like

1 OCCU butcher
2 DATE ABT 1850
2 PLAC New York City, New York, USA
2 CITA @789@

and

0 @C789@ CITA
1 SOUR @S123@
2 PAGE 456
2 DATA
3 TEXT ... butcher ...
4 CONT ... New York City ...

Hope this makles sense.

hartenthaler commented 11 months ago

It makes sense! Good idea.

However, it is not in line with the GEDCOM standards. Edit: oh, I missed the context. This is the right place for this idea. Sorry.

Norwegian-Sardines commented 11 months ago

A complete redesign and “Normalization” of GEDCOM has been discussed. This can’t happen until GEDCOM v8 or later!

reteP-riS commented 11 months ago

However, it is not in line with the GEDCOM standards.

I am suggesting to add it to the GEDCOM standard.

webtrees-pesz commented 11 months ago

Instead of a new record type I would prefer a more general solution.

In GEDCOM 5.5.1, reference was made to a future concept that makes substructures of records addressable.

Quote from GEDCOM 5.5.1: The pointer represents the association between two objects that usually reside in different records. Objects within a logical record can be associated. If this need exists, the pointer record composition contains an exclamation point (!) that separates the parent record's cross-reference ID from the specific substructure's cross-reference ID, which is at some subordinate level to the logical record at level zero. The cross-reference ID of the substructure subordinate to a zero level record, for inter-record associations is always composed of the Record ID number and the Substructure ID number, such as @I132!1@. Including the Record ID number in the pointer that associates objects within a record will allow the GEDCOM processors to build the index only at the record level and then search sequentially for the appropriate substructure cross-reference ID. The parent record ID is assumed when the cross-reference ID begins with a exclamation point (!) signifying an intra-record association.

Applied to Sir Peter's case presented here, the structure would look as follows:

0 @I132@ INDI
….
1 OCCU butcher
2 DATE ABT 1850
2 PLAC New York City, New York, USA
2 @I132!1@ SOUR @S123@
3 PAGE 456
3 DATA
4 TEXT ... butcher ...
5 CONT ... New York City ...

0 @456@ INDI
….
1 OCCU butcher
2 DATE ABT 1850
2 PLAC New York City, New York, USA
2 SOUR @I132!1@ 
Norwegian-Sardines commented 11 months ago

Instead of a new record type I would prefer a more general solution.

In GEDCOM 5.5.1, reference was made to a future concept that makes substructures of records addressable.

Quote from GEDCOM 5.5.1: The pointer represents the association between two objects that usually reside in different records. Objects within a logical record can be associated. If this need exists, the pointer record composition contains an exclamation point (!) that separates the parent record's cross-reference ID from the specific substructure's cross-reference ID, which is at some subordinate level to the logical record at level zero. The cross-reference ID of the substructure subordinate to a zero level record, for inter-record associations is always composed of the Record ID number and the Substructure ID number, such as @i132!1@. Including the Record ID number in the pointer that associates objects within a record will allow the GEDCOM processors to build the index only at the record level and then search sequentially for the appropriate substructure cross-reference ID. The parent record ID is assumed when the cross-reference ID begins with an exclamation point (!) signifying an intra-record association.

This is and will always be a bad design. Pointers within and to record instances was discussed before as well. “Normalization” should be part of the design!

tychonievich commented 11 months ago

Maybe I'm missing something; why not include this text from the source in the SOURCE_RECORD.TEXT instead of the SOURCE_CITATION.TEXT?

fisharebest commented 11 months ago

Maybe I'm missing something; why not include this text from the source in the SOURCE_RECORD.TEXT instead of the SOURCE_CITATION.TEXT?

For something like a parish register, each entry would be a separate citation.

e.g.

2 SOUR @...@
3 PAGE Page: 17, Line 4
3 DATA
4 TEXT Peter, son of Martin and Mary Smith, farmer of this parish, was baptized on 28th February 1753

Each of these citations would be the source of many facts (OCCU of father, BAPM of child, MARR of parent) - and would only consist of the text from that single entry - not the entire register.

reteP-riS commented 11 months ago

A single parish register (the source!) could include many hundred events from more than two decades. In situations where a family (or clan) lived in that parish for decades their events may show up on almost every page so that dozens of citations from the the same source need to be documented.

I cannot think of any effective and efficient solution other than reusable citations.

webtrees-pesz commented 11 months ago

This is and will always be a bad design. Pointers within and to record instances was discussed before as well. “Normalization” should be part of the design!

Can you give me a link to this discussion?

tychonievich commented 11 months ago

This is and will always be a bad design. Pointers within and to record instances was discussed before as well. “Normalization” should be part of the design!

Can you give me a link to this discussion?

https://github.com/FamilySearch/GEDCOM/discussions/328

Note there's not unanimity of opinion here; I like pointers to substructures because I work with object and graph databases, but they are not appreciated by those who've weighed in who work with relational databases. That said, making a design change that will be difficult for some implementations to handle is unlikely in the near future at least.

albertemmerich commented 6 months ago

reteP-riS proposed, and fisharebest commented:

Each of these citations would be the source of many facts (OCCU of father, BAPM of child, MARR of parent) - and would only consist of the text from that single entry - not the entire register.

I support this idea, and already had implemented it in the internal data structure of my application. However, there is a substructure which should not be integrated into the citation record: ROLE. Different individuals mentioned in the source entry will have the same citation text, but different roles. Another substructure is NOTE, which may be specific to one role.

So my solution is like this:

2 SOUR @S1@
3 _CIT @C1@
3 ROLE FATH
3 NOTE note specific for the father

with

0 _CIT @C1@
1 PAGE Page: 17, Line 4
1 DATA
2 TEXT Peter, son of Martin and Mary Smith, farmer of this parish, was baptized on 28th February 1753
1 SNOTE @N1@

0 @N1@ note for all roles mentioned in the entry

This structure enables the user to edit the citation record, and automatically updating it for every individual where it is used.

As for GEDCOM 7.0 at GEDCOM export I have to convert this structure replacing the _CIT @C1@ everywhere to the full text transcribed from the entry in the source. GEDCOM files' size is increased by this a lot, and even worse, after editing the data in other applications users in most cases are not able to edit a citation text at all callings in the same way. By this the original structure is destroyed.

I would be totally happy with the structure proposed by reteP-riS, putting the SOUR line within the citation record. Not doing so is only because the _CIT part will not be imported by most applications so far, so in case another application tries to interpret my internal code I want to preserve the SOUR @S1@ as minimum...

I hope the SOURCES team will help a structure like this to come in future GEDCOM versions!

Norwegian-Sardines commented 6 months ago

My belief is that if you include a so-called “citation record” in a future version of GEDCOM, this record should actually be the citation, not parts of the citation. Because each type of citation requires different components and varies based on which style is used, a companion “Template Record” would need to be created that outlines what the citation should contain, and where the data is found in the GEDCOM. At some point the Source_Record and the Citation _Structure would go away, with the template controlling and defining the fields being used to store the components of the citation in the Citation_Record for each citation type and style.

reteP-riS commented 6 months ago

@albertemmerich

However, there is a substructure which should not be integrated into the citation record: ROLE. Different individuals mentioned in the source entry will have the same citation text, but different roles.

I admit you make a point by indicating that the ROLE tag might require a specific treatment. I missed that because I never used it.

In GEDCOM 5.5.1 the SOURCE_CITATION can appear in PERSONAL_NAME_PIECES, in a FAM_RECORD, in an INDIVIDUAL_RECORD, in a MULTIMEDIA_RECORD, in a NOTE_RECORD, in an ASSOCIATION_STRUCTURE and in all individual's and family's EVENTs and FACTs and I wonder how the ROLE subtag should be used for a MULTIMEDIA_RECORD, a NOTE_RECORD, a FAMILY_EVENT or a FACT. These use cases don't really make sense for me.

A closer look at the ROLE tag reveals that - at least in GEDCOM 5.5.1 - it is a subtag of the EVENT subtag itself. So we are talking about SOUR:EVEN:ROLE which makes ROLE in fact a subsubtag of the SOUR tag. And it should be related to an EVENt which at least to me confirms that the above use cases for MULTIMEDIA_RECORD, NOTE_RECORD and FACT don't make much sense. And for a FAMILY_EVENT like MARR or DIV I wouldn't know what to put into the SOUR:EVEN:ROLE tag because you have 2 individuals with 2 different roles - husband and wife - but only one ROLE tag is permitted. Just use SPOU for spouses (plural)? I'm afraid this whole ROLE tag thing has not been thought through entirely when it was introduced.

With regard to your example

2 SOUR @S1@ 3 _CIT @C1@ 3 ROLE FATH 3 NOTE note specific for the father

it seems to be incorrect for two reasons.

  1. As mentioned before it needs to be SOUR:EVEN:ROLE, not simply SOUR:ROLE.
  2. 2 SOUR @S1@ 3 _CIT @C1@ permits an n:m relation for sources to citations while it should be a 1:m relation. In words: while a single source can hold multiple citations a single citation MUST be tied to a single source, not to multiple sources. There must be clear one-way connections, i. e. pointers CITA > SOUR > REPO.

With regard to backwards compatibility I think GEDCOM 7 should introduce reusable level 0 CITAtions while still permitting the GEDCOM 5.5.1 conventions for a SOURCE_CITATION, but it needs to be an EITHER ... OR, not both in the same record. But both solutions could of course exist in the same GEDCOM file. And yes, the ROLE tag thing should be resolved, too, but that's a different story and certainly requires more thoughts. Maybe permitting multiple ROLE tags would help. Or getting rid of EVEN:ROLE completely.

albertemmerich commented 6 months ago

Good point. However, role is very important for me, and I very often use it. Citing a marriage, there is no ROLE line in the family record, however the bride has a ROLE, and the bridegroom has a ROLE, and the witnesses have their ROLE. In their records _CIT comes with a _CIT.ROLE substructure. My example was a short version. As in all these records EVEN is the same (it is MARR), I put it into the citiation record. However ROLE differs and therefore cannot be in the citation record. The proposed structure is a braking change anyway, and we have to wait for a major new version of GEDCOM standard. So I do not see a big problem in rearranging the ROLE position within the citations.

Another point: I am not a friend of "either ... or" solutions. We have to many applications in the wild which only support one of the possible solutions, but not the alternate solution. Users see data loss at data transfer. We can avoid this by clearly define the structure to be used, and not open the way for different representations of same data.

albertemmerich commented 1 month ago

I add a file with my proposal using citation records CITA. (It is txt as ged I am not allowed to upload) sample_citation_records_8.txt In my application I already implemented that solution to enable the users to edit the citation for all events / individuals / families in one step for all records where it is used. Exporting to GEDCOM 7.0 the full source citation replaces every call of CITA, and the ROLE is moved to DATA.EVEN.ROLE. When importing GEDCOM 5.5.1 / 7.0 files the CITA structure is build automatically for internal use. Data coming back from other applications often have damaged citations, as they are modified differently at the various places in the file. This results in different CITA records which have to be merged again... Therefore it would be much better to have the CITA records in GEDCOM spec. Would like to see alternatives discussed by citation group using this example.

dthaler commented 1 month ago

Regarding "Would like to see alternatives discussed by citation group using this example."

This issue is also tracked by https://github.com/dthaler/gedcom-citations/issues/11