FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
351 stars 67 forks source link

Source specification is not sufficiently complete #165

Closed jralls closed 11 years ago

jralls commented 12 years ago

The specification for source merely recites the spec for an RDF triple, which also isn't a suitable citation. It goes on to say that "GEDCOM X recognizes the Dublin Core Metadata Terms as standard properties for a description of a resource", but says nothing about what to do with them. As an '''absolute minimum''' the spec should incorporate by reference the Dublin Core RDF Recommendation, but a detailed specification of how to map Mills's '''Evidence Explained''' into DC is really necessary to ensure consistent encoding.

EssyGreen commented 12 years ago

I wouldn't go so far as to re-write the whole of Evidence Explained but I agree re the need to explain and cross-reference RDF - detailing how each of the fields can be used in context in a user-friendly (non-technical) way.

stoicflame commented 12 years ago

We're already aware that the source specification is incomplete. We've been tracking that at #123. What's the difference between this issue and that one?

EssyGreen commented 12 years ago

Probably the same - I guess we haven't seen any movement there tho :)

jralls commented 12 years ago

123 is more general about what should go into a source object, in particular that DC isn't adequate to describe the wide variety of sources used in genealogy. Yes, the last clause about mapping EE into DC overlaps that.

This issue is that the specification that you released last week doesn't adequately describe the components that are needed to describe even a DC description in XML or JSON.

stoicflame commented 12 years ago

This issue is that the specification that you released last week doesn't adequately describe the components that are needed to describe even a DC description in XML or JSON.

Oh, I get it.

So you're just opening up a formal issue for the todo that's in there. Where it says "todo: list the dublin core terms here with their types and descriptions for convenience."

Okay, fair enough.

jralls commented 12 years ago

No, not quite. Description (para 3.1) is supposed to be a recitation of RDF, but RDF doesn't address DC (no surprise). One way to handle it would be to substitute the DC-RDF spec I pointed to in the original report, which fits in with your RDF-for-everything philosophy, but making the Description be a plain representation of DC in XML or JSON would be a lot more efficient. In fact, if you elevate it to a toplevel "Source Desciption" object, it might address Sarah's #144 '''and''' provide a framework for extending DC, needed for #123.

I wonder if, given that Conclusion and Record are now split, you should reconsider using RDF in the conclusion model. It makes a some sense in a web services context but it's awfully cumbersome in a self-contained data interchange file where external links will be too fragile.

BTW, the thrice-cited "known resource types" anchor doesn't exist, the "Attribution" one does but doesn't seem to work, and the URLs in paras 3.1 and 3.2 point to other non-existent anchors rather than the URL that they say they do.

jralls commented 12 years ago

Umm, what's the precedence of the different documents in the specifications directory? It appears that file-format-specification.md is the top level, but I'm unclear about whether the others are authoritative or exemplary, and if authoritative, which govern in the event of conflict.

file-format-specification.md is silent about how the Zip archive should be organized, too. Is each GenealogicalResource supposed to be in a separate text file? Is there supposed to be a directory structure besides META-INF/MANIFEST?

The conceptual model seems to lean rather heavily on RDF, but the RDF specification is mentioned only in passing; there's no dependencies section as there is in some of the other documents. There's a lot of extraction of RDF elements, but they seem not to be complete and are in any case missing context from the full spec. I'm sure it's all sorted out in your head...

EssyGreen commented 12 years ago

agreed - see #171

PS: Can you remind me how you do xrefs between issues again?

jralls commented 12 years ago

PS: Can you remind me how you do xrefs between issues again?

Um, you just did...

EssyGreen commented 12 years ago

Ah - duh!

stoicflame commented 12 years ago

I wonder if, given that Conclusion and Record are now split, you should reconsider using RDF in the conclusion model. It makes a some sense in a web services context but it's awfully cumbersome in a self-contained data interchange file where external links will be too fragile.

I'm listening :-)

So I think you're right that there's too much RDF in your face making things more complicated than they need to be. I think that needs to be addressed, particularly wrt #144. We just finished an initial milestone using the RDF Description to model the FamilySearch source objects and it didn't go so well. Too complicated to explain and build and test and....

As always, you've been very astute in identifying a feature that we're trying to provide in GEDCOM X that a lot of people might not understand: it's not a "self-contained" data interchange format. The file format, as designed today, can be used to transfer "swaths" of data from, e.g. an online family tree, and still reference "edges" of the data that might be online.

As you might imagine, FamilySearch believes this is an important feature. I understand your difference in opinion, though. Is it fair to sum up the difference of opinion like this?

FamilySearch believes that the URI should be used for references between objects in order to enable references to objects external to the file. @jralls believes that the cost of using the URI doesn't justify the benefits.

EssyGreen commented 12 years ago

I'd personally raise a cheer to the demise of RDF in the conclusion model but i don't think it will necessarily solve the issues of an overly complex structure which has not yet been clearly modelled to facilitate ease of use or tested to ensure viability.

stoicflame commented 12 years ago

Someday I've got to figure out what you guys mean when you say "remove RDF" because I still don't know what exactly what all the tears are about. I have a hard time believing you're talking about the XML namespace (since that's just serialization details), so I'm assuming you're talking about either how pervasively RDF is integrated into the source description object (which I agree needs to be addressed) or you're talking about using a URI for inter-entity references (which I'm acknowledge but differ in opinion as stated above).

ttwetmore commented 12 years ago

Here's a point I've made in other contexts.

The GEDCOMX archival format can be a very simple tag-based format, with no namespaces and with no RDF or URIs. All that is needed is a very clear specification of what each element and attribute mean (thinking in XML terms). By clear specification I mean, if namespaces and RDF are embraced, then each archival tag is defined (outside the archival format) in terms of the namespace it comes from, the tag it has in that namespace, the URI to its type definition if needed, and so on and so forth.

Then, if anyone really wants to get the GEDCOMX data in its fully decorated, namespaced and URI'd and RDF'd format, all they have to do is push some button on a conversion program and the data is converted to that form. This should keep everybody happy, from those of us who like to read and understand the transfer files, for those of us who have to write import and export features on software, and for the standards-lovin', RDF-toutin' web semanticist gurus who like everything so obfuscated that only they can understand it!.

stoicflame commented 12 years ago

That sounds reasonable. I like the direction you've articulated. A few comments:

ttwetmore commented 12 years ago

Ryan,

Yes, I meant the empty name space.

You worry that using only the empty name space would force name clashes. However, I covered that, maybe a little clumsily, in my comment, "...the tag it has in that namespace ...". What I mean is that the tag used in the simple, empty namespace format, does not have to be the same tag it is in its external namespace. This takes care any name clashes. There is nothing fundamentally confusing about this. All normal people would never know there was a difference, and those that do know are so used to the complexity of understanding namespaces, URIs and RDF, that the idea of mapping tag names from one set to another set is refreshingly trivial to grasp.

Yes I prefer JSON or rather something custom for genealogy over XML, but I am not greedy about that! If the model is right, the external file format doesn't really matter, and iconoclasts can easily write conversion programs if they feel they must.

stoicflame commented 12 years ago

What I mean is that the tag used in the simple, empty namespace format, does not have to be the same tag it is in its external namespace.

Yeah, I read that but I didn't fully get it. The only thing that comes up in my head when you say "external namespace" is that you're proposing two distinct XML serialization formats. Is that accurate?

If so, then my response is why bother? If you hate namespaces and other XML noise, why not just use the JSON serialization format?

ttwetmore commented 12 years ago

Ryan,

Well, I guess I am suggesting two XML serialization formats. But one is an order and half of magnitude smaller than the other, can be read by mere mortals.

I may be old fashioned, going along with my age, but here are two things important to me:

I want the external format to be as small and as easy to read as possible. You can certainly make arguments that these are silly goals, but they are important to me.

[Quick aside: the database in my LifeLines program, is a custom B-tree in which records are pure GEDCOM. I don't enforce any GEDCOM semantic standards other than basic lineage-linking tags, so users are free to extend GEDCOM for any particular purposes. They can add new record types, new substructures in old record types, new tags, and so on. For example it has been used in medical and genetic applications. LifeLines users edit their data directly in GEDCOM format. People who have heard me say that, and have never used the program, call me and it crazy, that it would never work, even that such a database is impossible. Those people who use the program swear by the ultimate flexibility they have by doing this.]

Okay, you can call me crazy if I were to say that a simple program that allows users to browse and edit, including adding and deleting GEDCOMX files, would be very handy. But it would. It would be slightly easier to write that program if the simpler XML format were used. However, anyone who provided such a program (and such a program really should be written during the development of GEDCOMX just as a testing and experimental tool) could of course do the complex to simple translations, letting users see and modify the simpler form, and translate back to obfuscated form in the database.

The XML to JSON issue does not make this go away, as the JSON format can also have all the URI and RDF complications, or can also be simplified down.

jralls commented 12 years ago

Boy, a bunch of good arguments here.

First off, there are two ways RDF is used at present in GedcomX: As a simple linking mechanism between XML elements (for example, a sourcereference element contains an RDF URI to identify the source description that the reference is pointing to). XLinks are simpler, and the use of RDF raises the temptation to expand into the other use of RDF:

As a description format which can be extensible on-the-fly and isn't necessarily localized. For this to work, the RDF URIs need to be resolvable, and they aren't necessarily local. This is the part that I think makes some sense in a web-services environment, though as I guess FamilySearch has discovered, it can have scalability issues if the extensibility part isn't kept under control. This isn't really an area that I have much experience with, so I'm not really comfortable expanding on the argument, but I'll observe that the most successful use of RDF, RSS, severely limits the depth of RSS graphs.

My concern with external links is that people put Gedcom files up on services where they exist, unmaintained, for years. Any external links in a file stored that way are likely to become stale, which has the potential of breaking the file. It might be appropriate in that case to have separate specifications for files and for web services.

jralls commented 12 years ago

Yeah, Gedcom uses what's often called the INI format. It also has a reputation for scalability issues, though Gedcom has successfully encoded some pretty large datasets. XML has proven scalability to several orders of magnitude larger documents (into the gigabyte range according to some folks on the BDXML Forum). It also has an extremely rich set of tools for data description, validation, manipulation, and presentation which are not available for INI files. ISTM the case for using XML is rather compelling.

ttwetmore commented 12 years ago

The INI format is used for configuration files on Microsoft systems. The syntax is quite different from that of GEDCOM. In the 24 years since I wrote my first GEDCOM parser I've never heard GEDCOM referred to as an INI file.

There is very little difference between GEDCOM and XML except for syntactic sugar. It is easy to convert back and forth between the two. They are both scalable, and since GEDCOM is more stingy on the character count, one would have to say that GEDCOM is more scalable that XML, thought it's a moot point.

People point out that the big advantage of XML is standard parsers on all platforms. What they fail to mention is that the DOM-based XML parsers, which are the only turn-key parsers that don't require customization, are exorbitant memory hogs that severely limit the sizes of the files that can be processed. Any XML project that has to deal with large files must be implemented by a parser either based on SAX interfaces, which requires quite a bit of custom code, or even a completely customized parser, which takes away all the advantages claimed for XML over ad hoc parsers.

In the current specification of the jar file for GEDCOMX archival, however, there would be no problem using DOM-based parsers. But this is only because GEDCOMX has taken the peculiar step of making each conclusion level entity, that is, each person and relationship, its own separate file, which has been pointed out a few times, brings along a number of problems.

There is so much momentum and downright near religious fervor behind XML, that there is no chance that it won't be the basis for the GEDCOMX archival format. However, its advantages are not overwhelming and it does have disadvantages. It is my hope that by using an XML format with the single empty namespace, one can minimize the disadvantages.

EssyGreen commented 12 years ago

I'm, wholeheartedly with @ttwetmore on this:

The GEDCOMX archival format can be a very simple tag-based format, with no namespaces and with no RDF or URIs. All that is needed is a very clear specification of what each element and attribute mean (thinking in XML terms).

++1!

I want the external format to be as small and as easy to read as possible.

++1!

I believe the simplicity of the old GEDCOM 5 standard was key to its success. To move away from this will spell its demise. I'm all in favour of XML but keep it simple so that it's easy to understand and survives the passage of time regardless of the passing fads of technology.

jralls commented 12 years ago

So I just dug into RDF a little bit more. The GedcomX conceptual model, para 3.1, says:

The Description data type defines a description of a resource. The Description is defined by the Resource Description Framework (RDF), but its definition is included here for convenience.

identifier

The identifier for the "Description" data type is:

http://www.w3.org/1999/02/22-rdf-syntax-ns#Description

"Description" is the middle word in "Resource Description Framework", and is defined by the entire body of current RDF recommendations at http://www.w3.org/standards/techs/rdf. rdf:Description. Calling it a "data type" is I think a misnomer.

Do you really mean to include by reference all of RDF?

lkessler commented 12 years ago

Essy said: "I believe the simplicity of the old GEDCOM 5 standard was key to its success. To move away from this will spell its demise. I'm all in favour of XML but keep it simple so that it's easy to understand and survives the passage of time regardless of the passing fads of technology."

I totally agree.

Louis

alex-anders commented 12 years ago

As a non technical genealogist, I have found it so easy to modify a GEDCOM file within NotePad++, to correct simple grammar/spelling etc faults on my part.

I would be disappointed if I was not able to continue some simple way of doing this.

So XML/GEDCOM format would work for me.

Alex

ttwetmore commented 12 years ago

Alex,

As you may know, XML editors exist. I think the only concern you would have is whether the XML archive format is based on the single empty namespace with very simple tags and no long URIs, or on the full XML/RDF/plethora of namespaces. I doubt anyone would want to edit the latter form.

Tom

pipian commented 11 years ago

I strongly favor retaining a namespace using the method the XML format currently uses. If such a namespace is not used, then it becomes impossible to mix GEDCOM X elements with other XML documents as other enclosing elements may already define the empty namespace (e.g. ) such that the GEDCOM X elements become associated with the enclosing namespace rather than with the GEDCOM X definitions.

This furthermore makes it impossible to validate such mixed documents as the enclosing schema may not define (or may appropriate different meaning to) the GEDCOM X elements. For example, would not validate as a valid HTML document because the person element is not defined in the html namespace.

Finally, no one should be playing with the internal XML in practice unless they are implementing their own editor application. Most people are going to exchange the files created by their genealogy programs, not write the XML by hand.

ttwetmore commented 11 years ago

Warning: this is a tongue in cheek response with logical smiley faces scattered throughout. I will leave it up to your imagination as to where those smilies belong. Whenever you disagree strongly with something I say here, imagine a nice smiley at that point smiling up at you.

Genealogical data is not complicated enough that it needs to mix namespaces from different application areas.

I know the argument. Let's get our source metadata from over there. Let's get our family history events and tags from over that other place. Let's get our date and place standards from over in that other direction. Let's add on some of the biographical vocabulary defined in that other spot where the family history vocabulary from the first place doesn't seem to cover everything. Let's use RDF strings to define our subjects, objects and predicates that are so darned long and so darned hard to understand for the mere mortals out there, of which I count myself one, we who hob nob (of whom I don't consider myself) must all seem like geniuses to be able to talk about them.

Yeah, this is the modern thing to do. And boy is it ugly and unnecessary and obfuscating as all heck, and a near impossible barrier for developers of genealogical applications to grasp and deal with.

And yeah I know this makes me a weird luddite because it seems I am refusing to acknowledge the ability to reuse XML parsers, and predefined schemas and the beauty and logic of RDF triples. Which in my weird upside down luddite world I see as nothing more than a bunch of overly restraining straight-jackets that I wish to have nothing to do with.

Give me a simple set of event types, a simple set of fact types, a set of relation types, a simple obvious model for names, places, dates, persons, events, and sources, and let me write good, solid, journeyman software. In other words let's tune up GEDCOM and get to work.

Fortunately for you, I am in the teeny tiny minority of people who think this way so it would be best for you to just ignore my whinings from the peanut gallery.

EssyGreen commented 11 years ago

@pipian - I think the underlying question is whether GEDCOM is intended to encapsulate all data which a user/app might want/use or only the core. Personally, I don't see that the all-data approach has any benefits since there is no guarantee that a "foreign" application trying to import the data will know what to do with anything that is not in the standard. If particular apps want to share data it will be more accurate for them to develop their own interface which ensures they can understand what is being transferred and accommodate it accordingly. This interface may or may not be GEDCOM based. If we don't use XML namespaces then a non-GEDCOM interface will be pretty much essential. Is that a problem? Personally I don't think so. Having said that, the XML namespaces are the least of my worries - it's the RDF and Dublin core which I believe are the main culprits - coupled with an overly weighty and complex model in places.

@ttwetmore - I'm also in the teeny tiny minority :)

pipian commented 11 years ago

@ttwetmore I would argue that mixing namespaces is actually not simply probable, but in fact likely. Developers may wish to offer rich text editing for comments. This rich text is often accessible in HTML format via various APIs. As a result, it may be useful to have HTML used in tandem with the genealogical data itself.

That said, I can see your general complaint that it is (at least in principle) unlikely for many other namespaces to be used. To a certain extent, the plethora of namespaces is due to the acceptance of RDF, where dealing with a dozen namespaces in a document is par for the course, since a single document may use properties defined in a dozen different specialized vocabularies. On the other hand, as I mentioned elsewhere, such complexity is generally hidden from the end user as well...

The ultimate idea of RDF (in my opinion) is that it allows for specialists to define their own semantic meaning to relationships. No one is beholden to a particular semantic meaning to the term wedding. If I want to differentiate a common-law wedding from a religious wedding, but existing terminology is incapable of expressing such a difference, I have the power to simply coin a vocabulary to extend the meaning of wedding by creating a common-law wedding property and a religious wedding property. Provided that there's a standardized mechanism for digging up the semantic meaning to my specialized wedding properties (i.e. I follow an established policy and make my properties available in a publicly-accessible RDF schema file) then it's easy to "entail" (i.e. to deduce the presence of) the basic wedding property in an application that does not understand my specialized properties. (Of course this has its own downsides; applications then have to implement an RDFS entailment reasoning engine.)

An argument could certainly be made that GEDCOM X could define its own versions of the FOAF classes and properties (and simply reference the FOAF versions in the schema using the powerful owl:sameAs relation for the semantic web geeks, provided that everyone agreed that the semantic meaning was identical). That would cut down on the namespace bloat tremendously for most practical purposes, as namespaces would only be necessary for extensions that are not covered in the main GEDCOM X namespace.

The downside to this, of course, is that it takes a turn towards the "kitchen sink" approach. As many detail-oriented users may have noted, GEDCOM 5.5 does not actually support military service dates on its own... These common event types are simply standardized extensions to the language imposed through the dominance of one genealogical application or another. Thinking up every relevant class, event type, participant type, and so on, is a very difficult task to do, so to a certain extent there is a benefit to allowing a group who "specializes" in knowledge about the types of events to organize and define the set of accepted event types, so as to decouple the main data model from ontological nitpicking over whether a christening event is the same as (or different from) a baptism event. Of course this means they need their own namespace to control their terminology without stepping on someone else's toes...

Anyway, with respect to RDF, I have drunken the semantic web kool-aid in the past, and mostly gotten better from it. It's hardly perfect, and it's nice to have a little bit of flexibility for the end users (triples are basically impossible to work with from an evidential standpoint; I've tried time and time again to try to do something about modeling the old Gentech model in RDF, and it's very much a nightmare.)

That said, I think there is something to be said for being able to be parsed by and being able to reduce to an RDF model, especially as the power of depicting events and relationships in the form of an interconnected graph rather than in the context of a single family tree lends to a much more holistic view of genealogy that is missing from the extremely constrained models of genealogy software which still have something of a difficult time expressing evidential sources and the involvement of co-located families in a natural way.

The key, really, is to have a simple conclusion model that cuts the cruft, while allowing for flexibility and extension, especially in the much more free-form world of evidence-based genealogy models.

@EssyGreen I think that it should be POSSIBLE to use a GEDCOM interface for extensions (and thus, this would require XML namespaces), because it means that it is possible to infer certain properties about the extensions. For example, with my wedding properties, if I follow a GEDCOM model, it should at least be able to gracefully degrade and recognize that the "common-law" and "religious" wedding events are events, based on the context, even though it might not know the meaning.

EssyGreen commented 11 years ago

I think that it should be POSSIBLE to use a GEDCOM interface for extensions

I'm not inherently against it providing that it doesn't add unnecessary complexity ... I agree that it should be possible to understand that something is, say, an event - I don't think anyone is disagreeing with that. The question is rather, are we prepared to have infinite complexity in order to support infinite extensions. XML vs GEDCOM vs whatever is neither the problem nor the solution ... the problem is what will an application do with the excess/non-standard data which it doesn't understand? There are two choices: throw it away or squash it in somewhere (e.g. a generic "Note"). It really doesn't matter whether the data was in XML or GEDCOM 5 or bits of paper.

thomast73 commented 11 years ago

With the refactoring that was merged with pull request #182, are there any issues that remain here?

jralls commented 11 years ago

Well, there's the huge ToDo for CitationTemplate, but you probably want a new issue for that. #182 otherwise addresses the original concerns, and the discussion about namespaces and such really belongs elsewhere.