FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
350 stars 67 forks source link

specify metadata association mechanism #21

Closed carpentermp closed 12 years ago

carpentermp commented 13 years ago

What is the general approach to getting the metadata of a GedcomX entity? I can think of a few options:

  1. Append “/meta” to the URL of an entity to fetch the metadata for that entity. I have grown to dislike this approach because it involves out-of-band manipulation of URL’s, which is taboo for REST.
  2. Have a Link in the “Links” section, e.g.: <link rel=”gx:metadata” …>
  3. Have the for an entity be conceptually a different representation of the same entity, so you don’t modify the URL, you use the same URL with a “application/gedcomx-metadata” HTTP “Accept” header.
  4. Return the metadata in HTTP response headers. When only metadata is desired, do a HEAD request.

I believe the status quo of the model is option 2? There are a couple of things that still bother me about option 2:

First, it has the problem that you can't fetch the metadata without first fetching the entity. Often, you may want to inspect the metadata to decide if you want to fetch the (generally bigger) entity.

Second, there are a number of things that end up being redundant in the metadata with the original entity. As an “alternate representation” (option 3) this doesn’t bother me. But as “more information about this entity” (option 2), it seems wrong.
Let me give you some examples of the redundancy I am talking about:

• SourceReferences. Record and Person have a list of SourceReferences (called "sources"). The WWW version of these entities also has a list of "Links", which could also include them as <link rel="source" ...>. On top of that, has a list of dc:source elements. That's potentially 3 different places for the same information.
• Other Dublin Core "linking" elements that have GedcomX counterparts are: references (relationship.personReference), isReferencedBy (person.relationshipReference), replaces (alternateIds), contributor, isPartOf (record.collection, collection.collection, etc.), hasPart (collection.links.link{rel="content"}), identifier (person.persistentId, person.links.link{rel="self"}) • Other Dublin Core non-linking elements that have GedcomX counterparts are: bibliographicCitation (all entities need this), title (all entities need this), description (collection.description), coverage (collection.coverage, useful for all entities), publisher (collection.publisher), spatial (collection.coverage.spatial), temporal (collection.coverage.temporal)

This redundancy is one of the reasons that in SoRD we embedded Metadata in each entity (the other main reason being that many of the different types of metadata are needed on virtually every request, so that having to make another request to get it is onerous). Embedding the metadata within the response has precedence in both HTTP and HTML. In HTTP, a response consists of the response headers and the response body. The response headers are metadata about the requested entity. Clients can get just the metadata by doing a HEAD request. In HTML, metadata is available inside of the element. Interestingly, there was a need to get just HTML metadata (the stuff inside the element), so the proposed approach was to prepend "WWW-" to the element name and return it as an HTTP response header on a HEAD request. For example, becomes the "WWW-Link:" response header, and becomes the "WWW-Title:" response header. We could potentially do something similar by prepending our own prefix to different metadata elements.</p> <p>I propose that we either:</p> <p>1) Embed a DublinCoreMetadata element in every GedcomX entity and support a way of fetching this metadata without fetching the entity itself, either by responding to an Accept header that specifically returns the DublinCoreMetadata as a root element (option 3), or by encoding these elements as HTTP response headers in a HEAD request (option 4).</p> <p>or...</p> <p>2) Don't embed the whole DublinCoreMetadata, but make a list of the most important metadata needed by all GedcomX entities and explicitly embed them in each entity. This list would probably include at least these:</p> <p>bibliographicCitation title coverage contributor modified sources isPartOf</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/stoicflame"><img src="https://avatars.githubusercontent.com/u/145838?v=4" />stoicflame</a> commented <strong> 13 years ago</strong> </div> <div class="markdown-body"> <p>I am strongly against adding metadata to the entities. It blurs the entity boundaries, violates a number of different design principles (interface segregation, single responsibility), and breaks with RESTful principles (e.g. cacheability).</p> <p>I agree that mechanisms for getting metadata needs to be explored well-documented and I could support any of the suggestions above. I could see an implementation using all four of them. For example, an HTTP GET to <a href="http://localhost/entity/id">http://localhost/entity/id</a> results in an entity with a bunch of HTTP response headers that have its metadata in it. The entity itself references (via links) it's metadata representation with happens to be at <a href="http://localhost/entity/id/meta">http://localhost/entity/id/meta</a>. And an HTTP GET to <a href="http://localhost/entity/id">http://localhost/entity/id</a> with an Accept header value of "application/rdf+xml" results in a redirect response to <a href="http://localhost/entity/id/meta">http://localhost/entity/id/meta</a> .</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/carpentermp"><img src="https://avatars.githubusercontent.com/u/782660?v=4" />carpentermp</a> commented <strong> 13 years ago</strong> </div> <div class="markdown-body"> <p>What I am trying to avoid is the need for a separate GET request to get the metadata of an entity. I want 1 request that returns the entity, and its metadata. Imagine how tedious HTTP would be if HTTP response headers had to be fetched independently in a separate request. Your arguments against this don't compute for me:</p> <blockquote> <p>It blurs the entity boundaries</p> </blockquote> <p>What? On the contrary, the metadata is not a new entity. Rather it is data about an entity. To create a new URL for it is akin to creating a new entity--THAT blurs entity boundaries. </p> <blockquote> <p>violates a number of different design principles (interface segregation, single responsibility)</p> </blockquote> <p>I'm going to need further clarification of this point, because I don't see this at all. The single responsibility of a GET request upon an entity is to return information about that entity--data AND metadata.</p> <blockquote> <p>and breaks with RESTful principles (e.g. cacheability).</p> </blockquote> <p>Once again, I believe cacheability is broken more by the separation than by the combination. In all of this the fundamental difference between our points of view is that you seem to be seeing an entity and its metadata as independent things, where I assert that an entity's metadata is (or ought to be) indivisible from the entity itself. A change to the metadata of an entity IS a change to the entity. </p> <p>Over the years, systems have been developed that store entity metadata separately from the entity itself. This is generally done, not because it is thought a good idea, but because no provision was made in the original specification for storing this kind of information within the entity itself. In such systems, data exchange is complicated by the possibility of the entity being separated from its metadata. Image formats, such as JPEG, have recognized this and have provided a way for metadata to be stored within the JPEG file.</p> <p>Your claims that providing a mechanism to describe metadata within GEDCOMX entities "blurs entity boundaries", "breaks with REST principles", and impairs "cacheability" would seem to imply that HTTP and HTML suffer from these problems, since they do just what I am urging for GEDCOMX. This is a very bold claim since HTML/HTTP is far and away the most successful RESTful system ever devised. To say that it "breaks with REST principles" is particularly hard to justify since REST was invented as a term to describe the web <em>as it exists</em> and to explain what made it so successful. I don't say it isn't possible to invent something even more "RESTful" than the web, only that if you are to claim that the web suffers from some pretty severe REST deficiencies you had better be ready to explain very clearly why this is so.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/stoicflame"><img src="https://avatars.githubusercontent.com/u/145838?v=4" />stoicflame</a> commented <strong> 13 years ago</strong> </div> <div class="markdown-body"> <blockquote> <p>What I am trying to avoid is the need for a separate GET request to get the metadata of an entity. I want 1 request that returns the entity, and its metadata.</p> </blockquote> <p>Sweet. Go for it. What's stopping you? Create a new resource that contains both the entity and it's metadata and return it in a response.</p> <blockquote> <p>On the contrary, the metadata is not a new entity. Rather it is data about an entity. To create a new URL for it is akin to creating a new entity--THAT blurs entity boundaries.</p> </blockquote> <p>I disagree. The metadata is a different entity. </p> <p>To illustrate, let's take the versioning question. What happens if I just want to add a new title to a record in a different language? Or if I just want to update the bibliographic citation of a record? Does the version of the record change? I hope not. None of the data in the record changed. The version of the metadata changed, but not the version of the record.</p> <blockquote> <p>I'm going to need further clarification of [violates a number of different design principles]</p> </blockquote> <p>Interface segregation: "In a nutshell, no client should be forced to depend on methods it does not use." (Martin, Robert (2002). Agile Software Development: Principles, Patterns and Practices. Pearson Education.)</p> <p>So consumers who don't care about the metadata for the record should not be forced to depend on the whole metadata package.</p> <p>Single responsibility: Martin defines a "responsibility" as a <em>reason to change</em> and concludes that a class or module should have one, and only one, reason to change.</p> <p>IMO, it's a Bad Idea, for example, to have to rev the version of the record object just because we want to add some support for additional metadata. The more we break the single responsibility principle, the harder it becomes to keep models decoupled and therefore the harder it is to keep them living and breathing and changing to meet future needs.</p> <blockquote> <p>The single responsibility of a GET request upon an entity is to return information about that entity--data AND metadata.</p> </blockquote> <p>Sure. But I'm not talking about the single responsibility of a specific operation. I'm talking about the single responsibility of the model.</p> <blockquote> <p>Once again, I believe cacheability is broken more by the separation than by the combination.</p> </blockquote> <p>Not so. </p> <p>Let's take a different example in "the most successful RESTful system ever devised". An HTML page references a CSS stylesheet. Sure, you could take the CSS and inline it into every HTML page that references it, but what if you make an update to one of the styles? You would be updating every page that inlines the CSS stylesheet. You've broken the browser's ability to cache the CSS stylesheet so it can tell the difference between the HTML being updated and the CSS being updated.</p> <p>If storing metadata separately from the entity itself is "generally done, not because it is thought a good idea" then why didn't the people who designed HTML specify a way to inline binary image data into the HTML page? Then we could have included that metadata all in a single request and not have to make all those bothersome extra requests to get the images associated with the data.</p> <blockquote> <p>In all of this the fundamental difference between our points of view is that you seem to be seeing an entity and its metadata as independent things, where I assert that an entity's metadata is (or ought to be) indivisible from the entity itself. A change to the metadata of an entity IS a change to the entity. </p> </blockquote> <p>Yep, you're absolutely right. This is the fundamental difference between our points of view.</p> <blockquote> <p>Over the years, systems have been developed that store entity metadata separately from the entity itself. This is generally done, not because it is thought a good idea, but because no provision was made in the original specification for storing this kind of information within the entity itself. In such systems, data exchange is complicated by the possibility of the entity being separated from its metadata. Image formats, such as JPEG, have recognized this and have provided a way for metadata to be stored within the JPEG file.</p> </blockquote> <p>It totally depends on what type of metadata you're talking about.</p> <p>Let's take your example about a JPEG image file format. Sure, there is certain metadata about the image that is fundamental to the nature of the image. Geo codes, date taken, etc. These are the kinds of 'metadata' that are stored with the image. But there are other kinds of metadata that are <em>not</em> stored with the image. Let's say the image is stored at Flickr. The date the image was uploaded to Flickr, the labels applied to the image at Flickr, the comments that users make on the image... you gonna store <em>those things</em> in the metadata of the image?</p> <p>Same thing for this model. There are some things that are fundamental to the nature of the object that need to be included in its definition. Let's identify them, talk about whether they indeed belong to the model of the entity, then add them to the entity itself.</p> <p>Of the specific things you listed above, contributor, modified, sources, and isPartOf are currently defined as part of the entity. Bibliographic citation, title, and coverage are not. Why don't we start different threads for each of these things so we can discuss whether they need to be added?</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/carpentermp"><img src="https://avatars.githubusercontent.com/u/782660?v=4" />carpentermp</a> commented <strong> 13 years ago</strong> </div> <div class="markdown-body"> <p>My responses to your responses:</p> <blockquote> <blockquote> <p>What I am trying to avoid is the need for a separate GET request to get the metadata of an entity. I want 1 request that returns the entity, and its metadata.</p> </blockquote> <p>Sweet. Go for it. What's stopping you? Create a new resource that contains both the entity and it's metadata and return it in a response.</p> </blockquote> <p>Your answer is not serious. Is the new resource I am to create part of the GedcomX standard or not? If not, what is the point? If so:</p> <ul> <li>Please add it, or tell me how to add it</li> <li>Explain how it fits into the model: Given a Person URL (perhaps previously bookmarked), how does he fetch this new resource in a single request? What is the versioning/caching model for this "combo" entity?</li> </ul> <blockquote> <blockquote> <p>On the contrary, the metadata is not a new entity. Rather it is data about an entity. To create a new URL for it is akin to creating a new entity--THAT blurs entity boundaries.</p> </blockquote> <p>I disagree. The metadata is a different entity.</p> <p>To illustrate, let's take the versioning question. What happens if I just want to add a new title to a record in a different language? Or if I just want to update the bibliographic citation of a record? Does the version of the record change? I hope not. None of the data in the record changed. The version of the metadata changed, but not the version of the record.</p> </blockquote> <p>That same rationale could be used to say that Name should be a separate entity. When the name changes does the version of the Person change? The version of the Name changed, but not the version of the Person. If you presuppose that Name ought to be its own entity this makes perfect sense. Using the same logic, what would stop us from taking each piece of information and making it its own entity? Yes, that's preposterous, but why? Name doesn't really stand on its own. It's the name <em>of the Person</em>. It really has no meaning in isolation. The same is true of BibliographicCitation. It is the BibliographicCitation <em>of the Person</em>. If the Metadata really did stand alone, if it really were a separate entity, it would have a BibliographicCitation <em>for itself</em>.</p> <p>So my answer is yes, if the BibliographicCitation changes, the Person is changed, and that is actually desirable (Though the chance of it changing ought to be very small. Remember that BibliographicCitations are used to document where someone got their evidence and are meant to be constructed in such a way that they will be useful <em>forever</em>.)</p> <blockquote> <blockquote> <p>I'm going to need further clarification of [violates a number of different design principles]</p> </blockquote> <p>Interface segregation: "In a nutshell, no client should be forced to depend on methods it does not use." (Martin, Robert (2002). Agile Software Development: Principles, Patterns and Practices. Pearson Education.)</p> </blockquote> <p>This statement cannot be taken at face value. If I can find a single client who cares nothing about Names am I to create an interface for that client that has Name removed? Attempting to live by the letter of that statement would result in a ridiculous number of interface permutations. Perhaps it makes sense in a closed software system where there is one client or a small number of known clients. But we are creating a system for vast numbers of unknown clients that allows them to browse and search genealogical data. "The client" in this case, is the complete set of systems that will ever access the data. Of necessity, our interfaces are going to be general in nature and we have to strike a balance.</p> <blockquote> <p>Single responsibility: Martin defines a "responsibility" as a <em>reason to change</em> and concludes that a class or module should have one, and only one, reason to change.</p> </blockquote> <p>Yes, and I don't see how this principle is violated in my proposal. The job of "Person" is to model all we know about a given person. Genealogical "context" (metadata) can often be as genealogically interesting as the data itself.</p> <blockquote> <blockquote> <p>The single responsibility of a GET request upon an entity is to return information about that entity--data AND metadata.</p> </blockquote> <p>Sure. But I'm not talking about the single responsibility of a specific operation. I'm talking about the single responsibility of the model.</p> </blockquote> <p>So am I--the single responsibility of the model to express information about a given entity.</p> <blockquote> <blockquote> <p>Once again, I believe cacheability is broken more by the separation than by the combination.</p> </blockquote> <p>Not so.</p> <p>Let's take a different example in "the most successful RESTful system ever devised". An HTML page references a CSS stylesheet. Sure, you could take the CSS and inline it into every HTML page that references it, but what if you make an update to one of the styles? You would be updating every page that inlines the CSS stylesheet. You've broken the browser's ability to cache the CSS stylesheet so it can tell the difference between the HTML being updated and the CSS being updated.</p> </blockquote> <p>This is a great example, but one that supports my position rather more than yours. CSS stylesheets are not "information about a web page" but are actually entities in their own right. They do stand alone, as evidenced by the fact that they are intended to be used in several web pages.</p> <p>On the other hand, every web page has a <head> section for metadata about the web page. It is extensible, and interestingly, has a standard extension for expressing Dublin Core metadata. Something along these lines is what I am arguing for.</p> <blockquote> <p>If storing metadata separately from the entity itself is "generally done, not because it is thought a good idea" then why didn't the people who designed HTML specify a way to inline binary image data into the HTML page? Then we could have included that metadata all in a single request and not have to make all those bothersome extra requests to get the images associated with the data.</p> </blockquote> <p>Once again, images are not "information about a web page" but entities in their own right. They may be used in several web pages, or stand alone.</p> <blockquote> <blockquote> <p>In all of this the fundamental difference between our points of view is that you seem to be seeing an entity and its metadata as independent things, where I assert that an entity's metadata is (or ought to be) indivisible from the entity itself. A change to the metadata of an entity IS a change to the entity.</p> </blockquote> <p>Yep, you're absolutely right. This is the fundamental difference between our points of view.</p> <blockquote> <p>Over the years, systems have been developed that store entity metadata separately from the entity itself. This is generally done, not because it is thought a good idea, but because no provision was made in the original specification for storing this kind of information within the entity itself. In such systems, data exchange is complicated by the possibility of the entity being separated from its metadata. Image formats, such as JPEG, have recognized this and have provided a way for metadata to be stored within the JPEG file.</p> </blockquote> <p>It totally depends on what type of metadata you're talking about.</p> <p>Let's take your example about a JPEG image file format. Sure, there is certain metadata about the image that is fundamental to the nature of the image. Geo codes, date taken, etc. These are the kinds of 'metadata' that are stored with the image. But there are other kinds of metadata that are <em>not</em> stored with the image. Let's say the image is stored at Flickr. The date the image was uploaded to Flickr, the labels applied to the image at Flickr, the comments that users make on the image... you gonna store <em>those things</em> in the metadata of the image?</p> </blockquote> <p>Perhaps not everything Flickr keeps about its images is of general usefulness, but many things are. Your example of "comments that users make on the image" is one. Identifying the people in photos and identifying their faces is another. It would be great if this information were standardized and embedded in JPEG files. JPEG already has an extensible way to embed metadata, just no standard for face-tagging data. Such a standard would make the data much more portable and useful. As it is, users of systems such as Flickr and Picasa are "locked in" to their system of choice by proprietary data that doesn't go with the image when it is exported.</p> <blockquote> <p>Same thing for this model. There are some things that are fundamental to the nature of the object that need to be included in its definition. Let's identify them, talk about whether they indeed belong to the model of the entity, then add them to the entity itself.</p> <p>Of the specific things you listed above, contributor, modified, sources, and isPartOf are currently defined as part of the entity. Bibliographic citation, title, and coverage are not. Why don't we start different threads for each of these things so we can discuss whether they need to be added?</p> </blockquote> <p>Yes, I could fight for "every inch of ground" in this fashion and perhaps get most of the benefits I am looking for, but this would not give me an extensible metadata section which is what would be best, in my opinion.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/stoicflame"><img src="https://avatars.githubusercontent.com/u/145838?v=4" />stoicflame</a> commented <strong> 13 years ago</strong> </div> <div class="markdown-body"> <p>It's pretty clear that we're not aligned on what metadata is. I think I need something concrete to help me understand. Why don't you submit a pull request that we can use as something concrete to continue the conversation? Thanks for your patience with my bemusement.</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/stoicflame"><img src="https://avatars.githubusercontent.com/u/145838?v=4" />stoicflame</a> commented <strong> 12 years ago</strong> </div> <div class="markdown-body"> <p>As of 2a49d49, we believe that bibliographicCitation, coverage, contributor, sources, isPartOf are all accounted for in the model. If any other items are needed, we'll address them in separate issues.</p> <p>The documentation for these important elements has been added to the <a href="http://www.gedcomx.org/Developers-Guide.html">Developers Guide</a>, see the sections named "Source Metadata" and "Supplying Source Metadata".</p> <p>Initial options for applying metadata via HTTP headers are documented at <a href="http://www.gedcomx.org/RDF-HTTP-Headers.html">RDF HTTP Headers</a>.</p> <p>Metadata association mechanism has been documented at <a href="http://www.gedcomx.org/RDF-Integration.html">RDF Integration</a>.</p> <p>Representations of <a href="https://github.com/FamilySearch/gedcomx/blob/master/gedcomx-record-www/src/main/java/org/gedcomx/record/www/RecordWWW.java">record</a>, <a href="https://github.com/FamilySearch/gedcomx/blob/master/gedcomx-conclusion-www/src/main/java/org/gedcomx/conclusion/www/PersonWWW.java">person</a>, and <a href="https://github.com/FamilySearch/gedcomx/blob/master/gedcomx-conclusion-www/src/main/java/org/gedcomx/conclusion/www/RelationshipWWW.java">relationship</a> that include their metadata for the sake of the REST API have been added to the model.</p> <p>Initial placeholder documentation (still sparse) for the REST API has been added at <a href="http://www.gedcomx.org/WWW-Developers-Guide.html">WWW Developers Guide</a>.</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>