FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
356 stars 67 forks source link

create a specification for citation templates #80

Closed stoicflame closed 11 years ago

stoicflame commented 13 years ago

The following commentary was submitted by @ttwetmore, including it here for discussion.

The GEDCOM X model includes citation strings, presumably as pre-formatted rich text that holds the citation string one would like to see in a bibliography or footnote. I don’t like this approach. I think most would agree that the best way to handle citation strings is to employ a set of templates for creating citation strings from different kinds of sources (e.g., books, microfilms, web pages, letters) in different contexts (e.g., mention in a bibliography, first mention in a footnote, later mention in a footnote). These templates would be used to generate citation strings directly from the source records, based on the fields/attributes found in those source records. With this approach a user of genealogical software can choose a set preferred templates (e.g., Chicago Manual of Style, Elizabeth Shown Mills) and have the citations instantiated as needed.

stoicflame commented 13 years ago

I happen to know that RealTime Collaboration is currently working on a standard for source templating. I would really like to see it created as a new project, a peer to GEDCOM X. I'm looking to @dovy, @johnvilburn, @ttwetmore, @earlmott, etc. to get this initiated.

carpentermp commented 13 years ago

The GEDCOM X model includes citation strings, presumably as pre-formatted rich text that holds the citation string one would like to see in a bibliography or footnote. I don’t like this approach. I think most would agree that the best way to handle citation strings is to employ a set of templates...

I agree completely that citation templates are needed. However, when serving a Record from a web service, as part of the Record we need the actual citation(s) (by applying the template to the data in the Record before serving up the Record). Handing to clients a set of templates that they have to apply would seem onerous to me, since every client would have to know how to apply the templates for themselves.

ttwetmore commented 13 years ago

When serving up a person record a web service can pass the person's source information through a template that the service has chosen to support; that is, the citation should be generated on the fly when needed. Clearly the citation strings do not need to be recorded in the model, as they should be a generated string, not an input string. Any FS service can define their own templates and use them.

However, if GEDCOM X is planned to be a general data model to be used by different applications, then it should be template neutral, and applications using GEDCOM X should be free to choose whatever templates either they choose to implement, or whatever templates it may allow its users to choose.

If a user has a choice of templates given them by an application, they would set their choice in a preference panel and never think about it again. I don't believe there is anything onerous about the template approach.

carpentermp commented 13 years ago

This statement confuses me a little:

Clearly the citation strings do not need to be recorded in the model, as they should be a generated string, not an input string.

Perhaps it would help me if you were a little clearer about what you propose. Do you propose removing "bibliographicCitation" from Record? If we did that, how would clients fetching a Record from a web service determine the citation of the Record (since it will no longer be supplied)? Would you add to the Record model a list of "templates" instead (thus delegating the responsibility to the client to apply the Record data to a template to produce the citation of his choice)? Or would you add a list of templates to RecordCollection? This would mean that clients wanting a citation would have to fetch the Record, then fetch RecordCollection, then apply the Record data to the template. Perhaps if you described exactly how you think it should work that would help.

ttwetmore commented 13 years ago

I define a citation as a formatted string that shows up in a few contexts -- footnotes in documents, bibliographic entries in articles or books predominantly. Or as a string that is returned by a service when it is serving up person records in a genealogical application. The format in a footnote might be different than the format in a bibliographic entry which might be different than a citation served up by a Family Search web application. But in all cases they are "just" strings formatted by certain rules. Often there are rules for the first time a citation is used in a footnote versus is second and later appearances. A citation is made up of things like the author's name, the title, the publication information, for books, and a few other attributes for other types of sources. The title might be in quotes or in italics, the author's name might be surname followed by a comma and initials. These rules are stylistic rules based on someone's views on what a citation should look like, e.g., the Chicago Manual of Style, Elizabeth Shown Mills, etc. If this is not your definition of a citation than what I say won't make any sense. These rules are generally expressed as templates to show how to format the attribute values into strings. For example:

[author.name.last], [author.name.firstinitial]. "[title]," [publisher], [publicationPlace], [publicationDate].

which would produce something like:

Wetmore, T. "The Descendants of John Wetmore, Revolutionary War Loyalist, 1767 to 1848," Barton Street Books, Boston, 2010.

GEDCOM X data should be just that, data. It should not carry stylistic content. Data and style are two different things. When GEDCOM X data is presented to a user, some application must take the GEDCOM X's XML data and format it for whatever purpose it needs to be. For example, middle names might be converted to initials or surnames put in upper case. And citations should be generated by finding the necessary attributes in the source objects and formatting them into the correct strings. The point is that the GEDCOM X data is data, and any stylistic interpretation of that data must be done by the applications that use or serve up the data.

Think of the original purpose of HTML. It was to express the structure of documents, not the style of documents. You specify the title with [title], but not how the title should be rendered. You specify the paragraphs with [p], but now how the paragraphs are to be indented, or spaced, or what font they will be in. The job of applying styles to HTML as it is rendered is a task performed by the renderer. Now we use CSS to define that style in detail. Or think XML and XSLT, where XML holds the "data content" and XSLT stylesheets interpret the content into various forms for display.

Put GEDCOM X in the place of HTML. It should hold the structure of the data, not how the data should be displayed or formatted. A citation is stylistic information -- that is the key concept -- a citation is the stylistic rendering of data that should already be in the GEDCOM X data in its pure "data form".

So yes, I don't believe that bibliographicCitation has any place in GEDCOM X. GEDCOM X will have the source information where the title and author and other information will be found. Any application that wishes to present citation strings to a user will take those attributes from the source objects and format them into citation strings by using either by hard coding or by applying a citation template.

The title of a book is data content. How that title is to be displayed is stylistic. The title of a source belongs in a GEDCOM X source object. How that title should be displayed in a citation has no business being in GEDCOM X. The publication date of a book is data that should be in a GEDCOM X source object. How that date will be formatted inside a full citation string should not be in the data.

This is not original with me. This is how it is done in most models.

You are concerned with where the templates are and who applies them. Imagine an FS search application returning data through an API. If the FS application wants to provide the citation string through that API, the application applies the template to the source data, creates the citation string, and passes that string back as part of the API. In this case the FS application has decided exactly what citation template it wants to standardize on and always uses them. So instead of templates the FS application can just hard code. So clients do not determine the citation -- the server creates it on the fly by combining info from the source objects and the standard citation template. Clients know nothing about this.

However, if the clients get their data directly in pure GEDCOM X XML, it will be their job to decide what to render and how to render it. But this is the case for ALL the data returned -- there is nothing special about the source object information. The client would have to format the names, the events, the notes, the relationships; so it also has to format citations from the source objects.

GEDCOM X needs to know nothing about templates. They are fully external to the GEDCOM X model. Just like how you might want to format a person's name is fully outside GEDCOM X. All GEDCOM X data does is keep the chosen attributes (eg., title, author, page number, volume number, ...) in their source objects. The templates are applied by an application to these source objects to create the citation strings by looking up the values of the chosen attributes and applying stylistic rules to the values returned. So no templates in the model.

Imagine a template approach for persons' names. Some people like surname first, some in natural order. Some people like surnames in all caps, some in normal case. Some people want full middle names unless the string would be too long, in which case they want just initials. There is nothing fundamentally different with formatted name strings and citation strings. You don't put the specially formatted name strings in the GEDCOM X. Some application somewhere must convert the name data into the string to be displayed. Same deal on the citation strings.

stoicflame commented 13 years ago

@carpentermp hates the idea of having a Source object as a layer of indirection between the conclusion and the real thing that it the source. And I don't blame him. Especially because there is an alternative pattern that, in my opinion, fits better. It's a pattern that has been well established by other industry standards, although it will be new to the genealogical development community.

First of all, I'd like to assert that having a string field called bibliographicCitation is very useful to clients who don't care about the complexities of source templating. And it doesn't hurt clients who do want to care about source templates as long as there is a reference to which template is being used for the value of the bibliographicCitation.

The pattern that has been established can be summarized in the following points:

  1. Source references generally reference the real source. For example, if an image on the web is referenced as a source, the URI to the image on the web is used; the reference is not a pointer to some Source object that in turn points to the image on the web.
  2. In a lot of cases, the real source can carry its own "source metadata". When I say "source metadata", I'm talking about all the data that is needed to create a proper citation string using a source template. All GEDCOM X objects, for example, carry their own source metadata.
  3. In a lot of other cases, the real source does not carry its own source metadata. For these cases, a description of the real source is provided. The description provides all the necessary source metadata that can be used to create a proper citation. The description is made using terms that are defined by Dublin Core.

So here's a really quick-and-dirty view of how this might look:

<person>
  <source resource="http://images.com/myimage"/>
</person>

<description about="http://images.com/myimage">
  <title>...</title>
  <issued>...</issued>
  ...
</description>
ttwetmore commented 13 years ago

This leaves me confused. The "real things" that are the sources are generally physical artifacts, so these physical artifacts MUST be represented by Source records that stand in for them. If we wish to refer to a physical artifact there must be some Source record to provide the necessary layer of indirection between the internal world of bits and bytes and the external world of paper and microfilm.

Frankly I think the DeadEnds model of a SourceReference and a Source is basically perfect. A Source represents a physical source, and a SourceReference is a reference to that Source that may contain additional information about a location within that Source. Full example just below.

In this model a Source is recursive. A Source to a journal article would contain a SourceReference to another Source object that represents the journal in abstract.

Here's an example. We have person 444333 referring to an article in the New England Historic Genealogical Register. This is done by using a SourceReference in the Person that refers to the article, while the reference also has the location, page 34, where the evidence in the article comes from. Now look at the Source record for the article. It has the title and the author of the article and then has a SourceReference to the journal as a whole. That Source Reference also contains the issue number and volume number in the journal where the article is located. And finally there is the Source record for the journal.

<person id="444333">
  ...
  <sourceReference id="12345">
    <page> 34 </page>
  </sourceReference>
</person>

<source type="journalArticle" id="12345">
  <title> Ancestors of James Marshall </title>
  <author> Fred Joseph Snurfbucket </author>
  <sourceReference id="45678">
    <issue> April </issue>
    <volume> xxvii </volume>
  </sourceReference>
</source>

<source type="journal" id="45678>
  <name>New England Historic Genealogical Register</name>
  ...
</source>

And not to put too fine a point upon it, there is no citation string in here anywhere, but all the information needed to generate the citation string is in here!

johnvilburn commented 13 years ago

Ryan,

I have taken the liberty of rearranging your last response to group subjects together for ease in responding.

On Sep 30, 2011, at 5:54 AM, Ryan Heaton wrote:

First of all, I'd like to assert that having a string field called bibliographicCitation is very useful to clients who don't care about the complexities of source templating. And it doesn't hurt clients who do want to care about source templates as long as there is a reference to which template is being used for the value of the bibliographicCitation.

I agree.

@carpentermp hates the idea of having a Source object as a layer of indirection between the conclusion and the real thing that it the source. And I don't blame him. Especially because there is an alternative pattern that, in my opinion, fits better. It's a pattern that has been well established by other industry standards, although it will be new to the genealogical development community.

The pattern that has been established can be summarized in the following points:

  1. Source references generally reference the real source. For example, if an image on the web is referenced as a source, the URI to the image on the web is used; the reference is not a pointer to some Source object that in turn points to the image on the web.
  2. In a lot of cases, the real source can carry its own "source metadata". When I say "source metadata", I'm talking about all the data that is needed to create a proper citation string using a source template. All GEDCOM X objects, for example, carry their own source metadata.
  3. In a lot of other cases, the real source does not carry its own source metadata. For these cases, a description of the real source is provided. The description provides all the necessary source metadata that can be used to create a proper citation. The description is made using terms that are defined by Dublin Core.

The Source object that is referred to in @carpentermp paragraph sounds a lot like the Source Reference described afterwards. To help me understand the objection to the Source object, please describe the contents of or purpose of this object and how that differs from the Source Reference.

Thank you, John

johnvilburn commented 13 years ago

Sounds to me like Tom and Ryan are saying the same thing with slight differences. The recursive Source seems to have the same function as the Dublin Core spec. Please help me understand the functional difference, if there is one.

Mahalo, John

On Sep 30, 2011, at 10:53 AM, Tom Wetmore wrote:

This leaves me confused. The "real things" that are the sources are generally physical artifacts, so these physical artifacts MUST be represented by Source records that stand in for them. If we wish to refer to a physical artifact there must be some Source record to provide the necessary layer of indirection between the internal world of bits and bytes and the external world of paper and microfilm.

Frankly I think the DeadEnds model of a SourceReference and a Source is basically perfect. A Source represents a physical source, and a SourceReference is a reference to that Source that may contain additional information about a location within that Source. Full example just below.

In this model a Source is recursive. A Source to a journal article would contain a SourceReference to another Source object that represents the journal in abstract. Take a look at the following example. We have person 444333 referring to an article in the New England Historic Genealogical Register. This is done by using a SourceReference in the Person that refers to the article, while the reference also has the location, page 34, where the evidence in the article comes from. Now look at the Source record for the article. It has the name and the author of the article and then has a SourceReference to the journal as a whole. That Source Reference also contains the issue number and volume number in the journal where the article is located. And finally there is the Source record for the journal.

<person id="444333">
 ...
 <sourceReference id="12345">
   <page> 34 </page>
 </sourceReference>
</person>

<source type="journalArticle" id="12345">
 <title> Ancestors of James Marshall </title>
 <author> Fred Joseph Snurfbucket </author>
 <sourceReference id="45678">
   <issue> 4 </issue>
   <volume> xxvii </volume>
 </sourceReference>
</source>

<source type="journal" id="45678>
 <title>New England Historic Genealogical Register</title>
 ...
</source>

And not to put too fine a point upon it, there is no citation string in here anywhere, but all the information needed to generate the citation string is in here!

Reply to this email directly or view it on GitHub: https://github.com/FamilySearch/gedcomx/issues/80#issuecomment-2253842

carpentermp commented 13 years ago

It seems like there are 2 issues being discussed here:

  1. Whether or not "bibliographicCitation" belongs on Record.
  2. What is the right model for Sources and SourceCitations.

@ttwetmore I believe I now understand your position on item 1, though you made a couple of statements that seemed at odds with what I believe your position to be. Restated simply, you propose that "bibliographicCitation" be removed from Record. The implication of this position is that web services serving up Records will not return a citation, thus it is left to the data consumer (client) to construct a citation from the data and metadata provided. The statements you made that leaves room for doubt about your position are:

I define a citation as...a string that is returned by a service when it is serving up person records in a genealogical application.

When serving up a person record a web service can pass the person's source information through a template that the service has chosen to support; that is, the citation should be generated on the fly when needed.

Both of these statements seem to suggest that the web service could apply the template rather than leaving citation generation up to the client. However, if "bibliographicCitation" is removed from Record, then the web service will have no place to put the citation it generates, thus no way of communicating it to the client. @ttwetmore if you could clarify this I would appreciate it.

As for item 2, I believe these are pretty deep waters and it may take some extended face-to-face discussion to arrive at a meeting of the minds on the issue. I would just point out that @stoicflame's example was to an "online resource", whereas @ttwetmore gave an "offline resource" example. For clarity, it will help if we try to keep everything apples to apples.

ttwetmore commented 13 years ago

John, I have never looked at the Dublin Core, so any similarities I hope can be chalked up to different people recognizing the same structure in the modeling.

ttwetmore commented 13 years ago

@carpentermp You are correct about my opinion, and to summarize: I don't think bibliographicCitation belongs in any model. So servers that provide pure GEDCOM X data can not provide that string, and obviously, the clients will have to construct them. However, the Source records and SourceReferences in the GEDCOM X will have all that is necessary for the clients to do that.

My point about the server applying the template first and then providing the string was in the context of APIs that are not based on transmitting GEDCOM X, but based on the server first processing the GEDCOM XML and then sending the results over the API via some JSON, Protocol buffer, or custom interface. Sorry I wasn't clear on that.

I'm not sure that bringing on-line sources into the Source/SourceReference world is really going to change things very much. Maybe we should have a few standard examples: book, journal article, birth certificate, census record, naturalization index, on-line pedigree, on-line grave site, and so on, just to see how things shake out.

ranbo commented 12 years ago

On bibliographicCitation: I think everyone agrees that we should model the structured "parts" of the citation (title, author, etc.) when we have it. Sometimes we get citations that are simple text (from a GEDCOM file, or a UI that allows users to enter it that way), so there had better be somewhere to store that when it's all we have. And pragmatically, I believe we should include the simple bibliographicCitation as a "default, plain-text" citation a client could use if they don't want to do anything fancy. This is similar to how we serve up a full name string in addition to the parts so that clients don't have to be smart enough to know that Korean surnames go first, if they just want to be a simple "person display" widget, for example.

ranbo commented 12 years ago

On online/offline sources: A Source object is necessary as the online data structure that describes an offline source (like a book). It is also necessary for describing online sources (like web pages) that don't know how to describe themselves. However, "GedcomX-compliant" online resources are ones that know how to serve up their own metadata. For those resources, the idea is that you can point right at them, and they either include metadata right in the object you get back (e.g., in the cases of person or record objects); or you can ask for their metadata (e.g., in the case of online images being served up by a GedcomX-compliant API).

carpentermp commented 12 years ago

Personally, I would tweak what @ranbo wrote slightly:

It is also necessary for describing online sources (like web pages) that don't know how to describe themselves.

In my mind, you always point at "the real thing" directly. However, offline resources have no "pointer" beyond the citation. Thus, for these resources it is reasonable to (also) point at an "online proxy" that has all the metadata about what you are really pointing at. If this "online proxy" is a "durable resource", that is best. When pointing at an online proxy, I would still keep the actual citation to "the real thing" and consider that logically "what I'm pointing at."

Now what about "online resources?" Online resources are "real things" in and of themselves, even if they are often derivative of an offline resource. This is evidenced by the fact that an online resource has a citation that is distinct from the citation to the offline resource from which it is derived. Every online resource has its own unique identifier (URI). Some online resources may be able to serve up their own metadata (e.g. GedcomX resources), others may not (generic web pages, images, etc.). Either way, I feel strongly that we ought to "point at" the "real thing" via its unique identifier (URI).

For those resources that don't have the ability to furnish their own metadata, it will be useful to have a service that can furnish the missing metadata. It would be best if the service had the ability to fetch this metadata via the unique identifier (URI) of the resource--then there would be no need to remember 2 URI's: one to the "real thing," and another URI to the metadata.

EssyGreen commented 12 years ago

I'm totally with ttwetmore on this ... the basic aim of citing a source is to enable the reader to find it for themselves ... this is already catered for by the source uri and meta data.

I do not agree that we should provide a cop-out field to simplify things since this creates ambiguity and duplication.

DallanQ commented 12 years ago

This is a really a separate issue, but I'll highlight it hear anyway. One thing I've found helpful in addition to templates is a database of source records that provide the fields to put into the templates, so the users don't have to. I've created a database of about 1M source records at WeRelate, which I'll be posting on github later this year, along with code to match user-declared sources in gedcom files to these "standard" sources. In case people are interested, you can see an example HTML page generated from the data: http://www.werelate.org/wiki/Source:Savage%2C_James._Genealogical_Dictionary_of_the_First_Settlers_of_New_England and you can find information about what the data looks like, which will be included in the open-source database, at http://gencontent.wikispaces.com/WeRelate+Source+Model

EssyGreen commented 12 years ago

One thing I've found helpful in addition to templates is a database of source records that provide the fields to put into the templates, so the users don't have to

I think that is a template and although I agree that they are useful I think they should be defined at the application level and not by GEDCOMX since it will be impossible to define all the fields that an application might deem useful.

DallanQ commented 12 years ago

I agree, defining a set of source templates seems like a separate effort. I believe that's what http://sourcetemplates.org/ is trying to do with the source templates donated by Millenia (LegacyFamilyTree.com)

jralls commented 12 years ago

There are two kinds of templates in play here:

The sort that Tom Wetmore objects to (and I agree) are those that take a set of data elements and format it for the end user. That isn't necessarily a citation string for a paper report, either: It could just as well be a BibTeX format for ingestion by a bibliographic program on the desktop or another webservice like Zotero. I agree with Tom that those don't belong in GedcomX.

The other kind, though, alluded to by Tom (though it seems he doesn't like to call them templates) and brought up as "off topic" by Dallan Quas are sets of descriptors for the different sorts of source descriptors. There are a bunch of these out there, but they tend to be in one of two flavors: Library or Archive. Since Genealogists use materials from both -- and from neither (I've never seen a grave marker in either a library or an archive), GedcomX will need to have either a (very large) vocabulary of its own or include something in the source description which indicates what vocabulary it's using. The former requires a lot of work to put together a comprehensive vocabulary; the second makes more work for client programs. I don't know at this point which way I prefer.

Another important point from Robert Raymond's Citations for Developers session at RootsTech: Descriptions of derivative sources should include a description of the source from which they are derived. Something like

    <Source Description id="foo" type="online_source">
        <URI>http://images.familysearch.org</URI>
       <derivative type="digital facsimile">
           <id>bar</id>
       </derivative>
    </Source Description>

    <Source Description id="bar" type="microfilm frame">
          <number>227</number>
          <part_of>baz</part_of>
    </Source Description>

    <Source Description id="baz" type="microfilm part">
       <number>3</number>
       <part_of>waldo</part_of>
       <derivative type="film facsimile">pepper</derivative>
    </Source Descruption>

    <Source Description id="waldo" type="microfilm">
        <FHL Film Number>0012345678</FHL Film Number>
       ...
    </Source Description>

    <Source Description id="pepper" type="birth register">
        <creator type="government agency">
            <name>County Clerks Office</name>
            <jurisdiction>placeId</jurisdiction>
        </creator>
        <repository>
        ...
        </repository>
        ...
    </Source Description>
EssyGreen commented 12 years ago

Descriptions of derivative sources should include a description of the source from which they are derived

I would prefer that the derivative source contains a pointer to the original which contains details of itself (see #136) ie N-tier sources

EssyGreen commented 12 years ago

Just to clarify from #136 ... could we have 2 specific properties in a source (both optional pointers to other sources):

Whilst I admit these are optional and hence could be omitted (either by the user or the application) their inclusion would provide a clear trail of how to get (a) the original and (b) the full context whilst also enabling the user/application to structure their sources in an N-tier/tree-like fashion.

Also a (preferably mandatory) RenditionType which describes the type of reproduction/derivative e.g. Original, Certified Copy, Image Copy, Transcription, Translation, Extract etc

DallanQ commented 12 years ago

@EssyGreen I like this approach.

stoicflame commented 11 years ago

We are moving our development of bibliographic metadata and citation templates over to FamilySearch/gedcomx-citation. Thanks for the great input on this thread.