dthaler / gedcom-citations

GEDCOM extensions for citations
1 stars 0 forks source link

Source vs Citation - Many people don't know the difference! #16

Open Norwegian-Sardines opened 1 month ago

Norwegian-Sardines commented 1 month ago

One of the areas that really makes me scream is that Software Companies and many Family Historians don't know (or at least perpetuated an incorrect perception of) the difference between a Source and a Citation. The original GEDCOM specification makes a very distinct difference between a "Source" and a "Citation" because they are different entities.

In essence this is a call for a "Citation_Record" to be added to GEDCOM to emphasize the difference.

The Issue Some family historians and software companies use the term "Source" in an inappropriate way as it pertains to footnotes (also known as a "citations"). They loosely refer to a source in a colloquial way, as in, "The citation represents the source of your assertion" and therefore they create a Source_Record that contains all of the information to be included in the citation. In most cases creating the Source_Record has now become a Citation_Record blurring the difference between the two distinct terms, Citation and Source. We will get into the reason they do this a little later in this document.

Lets define what a Source and Citation are and why the GEDCOM needs to keep them as separate entities going forward.

Source: This is the actual artifact, book, document, website, etc. from which we obtained our information.

Citation: This is a statement in which we identify what supplied the information for an assertion. (colloquially, the source)

When we add a Source_Record (identify the source) we are talking about the whole book, the whole newspaper, the whole website, the whole document, and potentially the whole census for a particular year (my interpretation!). The "citation" (in GEDCOM Citation_Structure) is associated with a particular Event/Attribute (a "fact") because it identifies exactly what supplied the data the "fact" is asserting, first by identifying the source, then by specifying "where-in-the-source" the data was found.

The current GEDCOM design structure was developed in a time before the World Wide Web ("Internet") was a used to source genealogy data. Most data was found in books. newspapers, census books and church records, that were physical documents you could hold in your hand. These sources fit a specific data model still used in most citation creation designs found in all of the contemporary citation style guides such as Chicago Style and Evidence Explained.

These style guides identify the following data points for sources such as; books, newspapers, documents, church records (in no particular order): 1) Author/Creator/Agency 2) Title 3) Publisher 4) Published/Creation Date

All of these data-points are currently part of the Source_Record.

Newer source types mainly found on the internet have additional source related data-points that need to be captured but a website is still a source (not yet a citation) with authors/creators, titles, publishers.

Why do applications create Source_Records that have more information than just the Source information Simply, because they need for "Source information only has to be maintained in a single location (the Source Record)." Quoted from one software company. They have turned the Source_Record into a "shared citation record", which degrades the value of having a hierarchy, instantiating a specific record pointing to all citations that came from a single source (the book, website, census, etc).

I Propose that GEDCOM needs to create a set of new records that support a shared citation scheme and a key or "Template_Record" that indicates what data is found in the "Citation_Record".

dthaler commented 1 month ago

When we add a Source_Record (identify the source) we are talking about the whole book, the whole newspaper, the whole website, the whole document, and potentially the whole census for a particular year (my interpretation!). The "citation" (in GEDCOM Citation_Structure) is associated with a particular Event/Attribute (a "fact") because it identifies exactly what supplied the data the "fact" is asserting, first by identifying the source, then by specifying "where-in-the-source" the data was found.

I'm not sure I entirely agree with that statement. In my view one of the problems that GEDCOM has (and always had) was the assumption that there are only two levels (citation and source). In reality, there might be an entry on a page in a chapter in a book/volume in a series. Or many other examples of more than 2 levels of nesting. Which level is "the source" is problematic in that for the same source different users might have very different interpretations and hence source records in their files, causing weird mishmashes when merging. The containment ("collection") mechanism in my proposal can deal with this by allowing a hierarchy of source records. Once you do that, then a "shared citation" record is no different in my view from a leaf in that hierarchy, which is what my proposal permits.

(I'll also add that having a hierarchy of source records with containment is something my FamilySearch-certified app has supported as a GEDCOM extension for well over a decade.)

Norwegian-Sardines commented 1 month ago

I'm not sure I entirely agree with that statement. In my view one of the problems that GEDCOM has (and always had) was the assumption that there are only two levels (citation and source). In reality, there might be an entry on a page in a chapter in a book/volume in a series. Or many other examples of more than 2 levels of nesting. Which level is "the source" is problematic in that for the same source different users might have very different interpretations and hence source records in their files, causing weird mishmashes when merging. The containment ("collection") mechanism in my proposal can deal with this by allowing a hierarchy of source records.

Your assumption about "levels" Source and Citation is where you and so many other people get it totally wrong These are not "levels" they do two totally and distinctly different things. GEDCOM has wrongly designed them using 3 levels, REPO->SOUR->DETAIL because it assumed that a single book was a source that was found at a library and we saw an assertion on page x. If we saw a different assertion on page x+5 all we needed to do was create a new DETAIL, the SOUR and REPO would not change.

As stated before: Source: This is the actual artifact, book, document, website, etc. from which we obtained our information. Citation: This is a statement in which we identify what supplied the information for an assertion. (colloquially, the source)

Outside of GEDCOM there is no such thing a Source_Record when it comes to a citation. In a Citation (the style does not mater) we have specific data points that make up a "well formed" citation. All of the data points in a citation come from some where in "The Source". But GEDCOM could be perfectly fine without the "Source_Record" and for that matter without the "Repository_Record as well. WHY??? Because all of the information is contain in a well-formed Citation.

It does not matter what people think "The Source" looks like, it could be the whole book, one book/volume in a series, a single page, a whole website, a page on that website. Identifying "The Source" is not important, but identifying the necessary parts needed to create a well-formed citation from "The Source" is important!

mother10 commented 3 weeks ago

When all this is so important, and so little known, then why is there no real good example, or examples anywhere in the FAQ's. Thats where they are ment for right? I am still strugling to figure out to do SOURces and REPO's the correct way, but the only example I find here is a testfile. I am certain when there are good, reallife examples easily findable, people will understand way better how the specs have to be read and interpreted. As I said I read and reread and I am still unsure how to do this properly. Huge GEDCOMs from the internet, do not necessarily have the correct structures done/implemeted for this, so they do not help. On our forum advises differ, so that does not help either.

Norwegian-Sardines commented 3 weeks ago

Tineke,

Before you can understand my position on using GEDCOM citations as they currently exist, we must first understand where the GEDCOM standard sits.

The current GEDCOM design for citations was developed in the early 1990's, at a time before most of the information people use in their citations was found on the internet. Instead we had to get out of our chairs at home and travel to a library or other repository of information. We had little or no access to the actual documents needed to gather information, so rather we used published books, newspapers, microfilm, church books and personal interviews to gather information. People very often call these "artifact sources". This was also a time before a more modern set of rules where established for creating a citation that included internet style citations, the Mills Citation Guide "Evidence Explained" ("EE") was fist published in 2007 although other similar guides are a little older.

Many people in humanities studies used the CMOS (Chicago Manual of Style) guide to develop "well formed citations" for the artifact sources that existed at that time.

For example CMOS suggests the following "full note" format. Book: Author first name last name, Book Title: Subtitle (Place of publication: Publisher, Year), Page number(s).

Newspaper: Author first name Last name, “Article Title,” Newspaper Name, Month Day, Year, Where Found.

Interview: Interviewee first name Last name, “Article Title,” interview by Interviewer first name Last name, Journal Name Volume, no. Issue (Month or Season Year): Page number(s).

Based on these sample formats (and other knowledge) we know that, in general, we need the following core elements for a "well formed full note" (aka, a citation):

Most of these elements can be found in one of the three parts of the GEDCOM design, Citation_Detail, Source_Record, Repository_Record.

This is how we would document (cite) Artifact Sources in the "old days" before the internet had everything. Artifact Sources are more or less the same, Published Books, Newspapers, Census Documents, Church Books, Grave Markers, Derivatives, Lists and Aids, but where we find them has changed! With the introduction of "EE", other full note styles like Chicago, APA and MLA are not used in Genealogy. I've read that we know of over 7,000 styles in use and of course if you are not publishing your work, just documenting your "where found" you are welcome to create your own style or use some of the many "simple styles" that other have invented.

It is this last point that is rarely talked about, don't get too hung up on creating a citation. You should have a citation for every fact you assert, but the format of that citation does not have to take a form from any textbook style, rather it should have enough information in it so that you and your readers can find the artifact source again and the exact place in the "source" where you can read the information at some later date. If you want to get more into the weeds about creating a full note, then additional information that has been stored in usable format must be added to your database. This is where we sometimes get very deep into the weeds about how to store, URLs for online artifact, page, volume, edition, date reviewed, Author Name order (first/last vs. last/first). Publisher information should have unique data elements vs the current design for one block of information.

I HOPE THIS HELPS, and does not confuse you more!!!!

Norwegian-Sardines commented 3 weeks ago

I’m not sure I really answer you question.

Let us look at the v5.5.1 GEDCOM document to see its suggested use for the GEDCOM elements in a simple example.

Fact_Detail

1 BIRT 2 DATE 02 OCT 1822 2 PLAC Weston, Madison, Connecticut 2 SOUR @6@ 3 PAGE Sec. 2, p. 45 3 EVEN BIRT 4 ROLE CHIL

Source_Record

0 @6@ SOUR 1 DATA 2 EVEN BIRT, DEAT, MARR 3 DATE FROM Jan 1820 TO DEC 1825 3 PLAC Madison, Connecticut 2 AGNC Madison County Court, State of Connecticut 1 TITL Madison County Birth, Death, and Marriage Records 1 ABBR VITAL RECORDS 1 REPO @7@ 2 CALN 13B-1234.01 3 MEDI Microfilm

Repository_Record

0 @7@ REPO 1 NAME Family History Library 1 ADDR 35 N West Temple Street 2 CONT Salt Lake City, Utah 2 CONT UT 84150

From this we can create a full note citation that could read:

Connecticut Madison County Court, Birth Death and Marriage Records, 13B-1234.01, [from] Jan 1820 [to] DEC 1825; Family History Library, Salt Lake City

Obviously the data stored is all there but not in a form that I used. This is a failing of GEDCOM to not have specific fields dedicated to the exact information you need for a specific style of full note citation. Yet, you can use the information to create a simple citation of "What original artifact we used", and "where we saw it". Which for most people is enough to cite your source!

mother10 commented 3 weeks ago

Hi @Norwegian-Sardines . (sorry but I could not find your first name :) )

First, I am very gratefull you do your utmost to explain. Not just for me, but hopefully for a lot more people when they read this! It helps a lot!! And yes me too had to get out of my chair, to a library for information, around 2003.

Second, I am beginning to understand. I only started with GEDCOM about 2 years ago, translating the userguide of our program, from English (and French) into Dutch. But I not just translated, but tried here and there to add more info and examples, to help people in understanding. (trying to read and understand the GEDCOM as that is the base where we work from.) Have a lot of certificats and prints, but not entered them yet.

Now the whole REPO SOURces thing, described here (https://docs.ancestris.org/books/user-guide/page/document-your-sources) was unclear for me, as until then I had never used it. I only translated. (Dutch version is here: https://docs.ancestris.org/books/gebruikershandleiding/page/leg-uw-stamboom-bronnen-vast )

Now when I understand what you just explained, I can see in that English version of our docs, that the term for what is called there "Source Property" Should have been "Source-Citation". That way it better connects to the GEDCOM specs itself. I see many people asking about SOUR and REPO on our forum, so it is trouble for many.

The whole thing there (in our English guide) confused me, as I tried to understand how to deal with a marriage certificat. (Did not really understand the GEDCOM specs myself at that time)

In earlier years a marriage was just 1 line of text in a churchbook, mentioning the 2 people that got married, but later in time it became a certificat with the couple (names, ages and occupations) all 4 parents (names, ages, occupations and sometimes where they lived) witnesses and such. So 1 certificat contains info that should be mentioned on many places, nut just one. So in my believe (at that time) that should be a source, as thats the only thing you can point to from different places. But thats wrong, if i understand you correctly.

The source should describe the book itself, mentioning all years of certificats that can be found inside, and the types of certificats too etc. Like in the SOUR example of your last post. The SOURce does not describe the one certificate we found. The SOUR points to its REPO. The REPO describes where the book is. And the Source-citation is in the part describing the Birth or other event.

But then there is one thing left: Here, in a marriage certificate, we seem to have many Source-Citations, namely for all people and events mentioned in that certificat, as I said above. But then we have to write out that same info (Source-Citation) many times havent we? It would be real handy if that too would only be one "thing" we can point to. Instead of writing the same thing many times, AND having the same info many times on different places in GEDCOM. Thats not really efficient is it?

So it was really great you wrote all this, it helped me a lot. And I will have to change my Dutch "translation" to make it more in accordance with GEDCOM.

Now I think it would be great if the example you gave, and maybe the example I mentioned with a lot more in one birth certificat, would be somewhere in the Technical FAQ. That is the first place I look when I cannot find it in the GEDCOM specs themselves. As GEDCOMs from internet do not necessarily have the correct format.

In your last note you gave an example of a full note citation. Where does that go? Does that go in SOUR.DATA.TEXT, maybe together with the transcription of the text on the certificat? Or does that go in SOUR.Note-Structure.

I keep on trying, one day I will understand everything! :) And when I do I will write it down so others do not have to search so long.

Norwegian-Sardines commented 3 weeks ago

In earlier years a marriage was just 1 line of text in a church book, mentioning the 2 people that got married, but later in time it became a certificate with the couple (names, ages and occupations) all 4 parents (names, ages, occupations and sometimes where they lived) witnesses and such. So 1 certificate contains info that should be mentioned on many places, nut just one. So in my believe (at that time) that should be a source, as that's the only thing you can point to from different places. But that's wrong, if i understand you correctly.

IMO. This is where the concept of "levels" of data is wrong, or at least inconsistent with other data relationships. (this is why I suggest dismantling the current concept and look for something different!)

But here is how I take things under the current concept!

The Repository is where the artifact source was found. The Source of the artifact source The Source_Citation is everything else!

But none of that is a Full Note Citation! It will contain parts of the citation (full note) but probably not in a form an automatic citation builder can use! (See elsewhere about creating a Citation_Record)

Repository: So if you found a church book on a website or at a library that website or library is the repository! That is easy.

Source:

The rest (Source_Citation) In most cases the rest is going to be "where found" information, page, URL of Certificate. If you actually have in-hand the certificate you don't need a "where found". You might need a "date acquired" due to the fact that some certificates are altered at a later date to reflect a personal need. (This could happen with birth certificates, removing or adding the father, altering the "sex" of the child)

The big Issue 1) Reusing all of the data across multiple assertions is not available! 2A) Not all information has a place to store in the current design! 2B) Information currently retained in the records does not have enough granularity for use!

HOWEVER As I've indicated before, if we want to create a "well-formed citation" for a work we are going to publish this design leaves out a lot of detail and specificity! This is why I think a new design is needed! But, Not everyone needs to create a "well-formed citation" because they are not publishing anything. They are mostly doing this for family or their own fun! Just knowing that you found the artifact, where you found it and where in the artifact you saw the information is good enough. GEDCOM has places for that information in most cases!

Norwegian-Sardines commented 3 weeks ago

In your last note you gave an example of a full note citation. Where does that go? Does that go in SOUR.DATA.TEXT, maybe together with the transcription of the text on the certificate? Or does that go in SOUR.Note-Structure.

This does not go anywhere. You create this from the elements found in the Source_Citation, Source_Record, and Repository_Record. As noted above, this can not be create directly from elements is this record but must be written, due to lack of data granularity! However, a reasonable simple citation could be created.

mother10 commented 3 weeks ago

Thanks for all the help!