FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
356 stars 67 forks source link

GedcomX Documents are 6x larger than equivalent Gedcoms #173

Closed jralls closed 11 years ago

jralls commented 12 years ago

As first noted by Tamura Jones and repeated in #134, GedcomX produces files which, even after compression, are several times larger than the Gedcom they're derived from using the Gedcom5 Coversion Utility. (Jones found the number to be 8x, I found 6x. That's likely due to differences in the Gedcoms used for the test and that I used a Mac and he used a Win32 system.

Some of that excess size is the need to recite all of the namespace URIs in every document. For example:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<gxc:relationship xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                            xmlns:foaf="http://xmlns.com/foaf/0.1/"
                            xmlns:contact="http://www.w3.org/2000/10/swap/pim/contact#"
                            xmlns:ns4="http://purl.org/dc/terms/"
                            xmlns:gx="http://gedcomx.org/" 
                            xmlns:gxc="http://gedcomx.org/conclusion/v1/" 
                            rdf:ID="F58-I147-I165">
    <rdf:type rdf:resource="http://gedcomx.org/ParentChild"/>
    <gxc:person1 rdf:resource="persons/I147"/>
    <gxc:person2 rdf:resource="persons/I165"/>
</gxc:relationship>

That's 545 characters, compared to 40 to describe a 2-person family in Gedcom5:

0 @F01@ FAM
1 HUSB @I001@
1 WIFE @I002@

That's 13.5X before compression.

If the namespace declarations are moved to a DTD called gedcomx.dtd and included in every GedcomX Zip:

<!ELEMENT gxc:relationship>
<!ATTLIST gxc:relationship xmlns:rdf CDATA #FIXED "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<!ATTLIST gxc:relationship  xmlns:foaf CDATA #FIXED "http://xmlns.com/foaf/0.1/">
<!ATTLIST gxc:relationship xmlns:ns4 CDATA #FIXED "http://purl.org/dc/terms/">
<!ATTLIST gxc:relationship xmlns:gx CDATA #FIXED "http://gedcomx.org/">
<!ATTLIST gxc:relationship  xmlns:gxc CDATA #FIXED "http://gedcomx.org/conclusion/v1/">
<!ATTLIST gxc:relationship rdf:ID CDATA>

That's obviously only the fragment of the DTD needed to encode the namespaces for the relationship element; the actual DTD would be much larger. Then our relationship element becomes:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE gnc:relationship SYSTEM "gedcomx.dtd">
<gxc:relationship rdf:ID="F58-I147-I165">
    <rdf:type rdf:resource="http://gedcomx.org/ParentChild"/>
    <gxc:person1 rdf:resource="persons/I147"/>
    <gxc:person2 rdf:resource="persons/I165"/>
</gxc:relationship>

Only 323 characters. Better, but still about 8x, and it carries the penalty that we have to use a validating parser. What might a minimal, RDF-free XML representation, where the XML is part of a single document and is written as tersely as possible, give us? Here's an example:

<gxc:relationship gxc:reltype="ParentChild" id="F58-I147-I165" gxc:person1="persons/I147" gxc:person2="persons/I165"/>

118 characters, or 3x. Considering that XML is a lot more verbose than Gedcom5, not unreasonable. It's partly down to the extra length of the identifiers, and to general redundancy: Consider that all of the information in this element is encoded in the id! That's in part because the example I chose doesn't have any SourceReferences, Notes, or an Attribution.

Not a very good Gedcom, but perhaps less contrived that Jones's, which from the names are ones he generated with his Gedfan program and apparently which contain no actual data.

So to summarize, the increased footprint has three sources: Lots of tiny documents with redundant namespace URIs, RDF, and the general verbosity of XML. That's balanced against using existing standards and the flexibility to reference data outside of the GedcomX file.

Is it the right tradeoff?

ttwetmore commented 12 years ago

Nice introduction to the issue; thanks.

The archive format does not need namespaces; they can be supplied by a schema-like document which does not need to be part of every archive file (nor does your DTD). The external format does not need any prefixes in specifying where id's come from. I would further simplify the example to:

<ParentChild id="F58" person1="I147" person2="I165"/>

That's 53 characters, for a lower expansion factor of 1.325X.

Relationships can be named by tags with no penalties.

The id's in this example are minimal. I believe the id's of top level elements in a GEDCOMX should be UUIDs, which are 128 bytes long, which would add a quite a bit to this example. (UUID's are normally expressed as 36 characters, with 32 of them being hexadecimal digits with 4 interspersed hyphens. The specific UUID format that I use expresses the 128 bits as 22 alphanumeric characters [for the geeks out there, each 6-bits of the UUID is converted to a character], which is probably the minimum possible, non-binary, size. The advantage of UUIDs is that every top level entity created by any GEDCOMX software, anywhere in the universe, at any time between now and the end of the universe, will be unique.)

jralls commented 12 years ago

The archive format does not need namespaces; they can be supplied by a schema-like document

Actually an XSLT style sheet, right?

Yes, that would work, though it would take careful elaboration to make sure that it won't get confused with more complicated examples.

EssyGreen commented 12 years ago

I like @ttwetmore's example and also support the use of GUIDs

ttwetmore commented 12 years ago

Actually an XSLT style sheet, right?

Hadn't thought of that, but yes, the archive-format to XML/RDF format converter could be implemented as an XSLT program. I would do it with SAX though. It would be simpler and more efficient, and maybe even understandable when done.

ttwetmore commented 12 years ago

I know I said: <ParentChild id="F58" person1="I147" person2="I165"/>

But this might be better: <ParentChild id="F58" parent="I147" child="I165"/>

And if you were going to do that then why not: <Relation id="F58" parent="I147" child="I165"/>

We're down to 47 characters now, for an expansion factor of 1.175 X.

EssyGreen commented 12 years ago

As long as we can also have, for example: <Relation id="123" person1="456" role1="uncle" person2="789" role2="nephew">

jralls commented 12 years ago

But this might be better:

And if you were going to do that then why not:

I don't have any problem with treating Parent-Child and Marital/Conjugal relationships as special cases, since our genealogy databases revolve around parsing those links to construct genealogies & pedigrees.

But Sarah's right that we must also support the general case:

As long as we can also have, for example:

And non-family relationship roles like "neighbor", "business partner", "attorney", "client", etc.

EssyGreen commented 12 years ago

And non-family relationship roles like "neighbor", "business partner", "attorney", "client", etc.

Absolutely - must be anything which the researcher deems useful.

jralls commented 12 years ago

I wanted a better test of gedcom5-conversion than is possible with the poor-quality Gedcom5 files I have handy (all made by mainstream programs like FTW and TMG, and generally not well sourced), so I tried the GeditCom Torture Test file. After conversion, the GedcomX file is 1/3 smaller than the original Gedcom5 file! Didn't take long to figure out why, either: The converter discards all of the text, both NOTE and TEXT tags. That's actually documented in the README. I found some other tags that aren't documented and wrote an issue for them.

ttwetmore commented 12 years ago

I agree with recent posts about roles.

lkessler commented 12 years ago

Excellent discussion! I love the way the three of you did quick development of a very simple but flexible and powerful structure for a Relation element. I could very well see a new standard, structured in a similar way with similar thinking, that developers would be willing to adopt.

Unfortunately, GEDCOM X and this are miles apart.

Louis

stoicflame commented 12 years ago

Thanks for opening up the issue.

Based on my analysis of the issue, I think the issue needs to be tackled on three fronts:

  1. The data blocking strategy, to be addressed at #183
  2. The archive mechanism, to be addressed at #184
  3. The serialization format, to be addressed at #185

An observation: the comments made so far on this thread are all about the serialization format. But if the goal of this issue is to really address the file size, the serialization format is the one (of the three) that has the least potential. Of course, I suspect that the major reason for the discussion is because everybody is all excited about reducing the current XML noise to make it look prettier. But that's kind of a distraction from this thread, don't you think?

jralls commented 12 years ago

the serialization format is the one (of the three) that has the least potential.

Well, we know that now thanks to your fine work.

I suspect that the major reason for the discussion is because everybody is all excited about reducing the current XML noise to make it look prettier.

That might be overstating the case a bit. It was the most obvious source of bloat, and you expressed surprise yourself at the effects of small block (i.e. file) size on the resulting size of the Zip file.

Because of that interaction I'm not sure that separating blocking strategy and archiving mechanism is appropriate.

stoicflame commented 12 years ago

Because of that interaction I'm not sure that separating blocking strategy and archiving mechanism is appropriate.

Fair enough. I almost lumped them together into one issue because they're very much tied together, but it seemed appropriate to separate them because it gives people a chance to talk about the two separately.

pipian commented 12 years ago

In a world where people have hard drives of 120GB and more, space is not an issue. Are users going to be really all that concerned that their genealogical information is going to balloon from 0.0001% of their hard drive to 0.0012%? Why is space so relevant when the size of the images and media attached to the GEDCOM will still dwarf the GEDCOM data by an order of magnitude? What's the real motivating use case here?

ttwetmore commented 12 years ago

Use case: I want to look at my data in its GEDCOM-X format and be able to understand it so I can be confident that it contains what I think it contains. I want the data to be as clear as possible to see and to read with no long URIs and no long obfuscating XML crap that makes it hard to read. To make this realistic the amount of stuff I have to look at should be as little as possible. Definitely no harder to read and understand than GEDCOM.

You will probably tell me that I shouldn't care about looking at the data and therefore this use case should be stricken from the books. And I will tell you that it is important, and that only a fool would trust their data to a (gosh darn it TEXTUAL) format that they couldn't quickly understand by looking at it.

You will probably tell me that there is no way I could understand the pictures inside a mimed jpeg image, and you're right, but I don't care about that, and I would never go an look at them in internal format.

It's not a size issue. It's an understanding issue. For me. Any my brain has severe limitations on understanding complicated things.

EssyGreen commented 12 years ago

@pipian - excellent point! @ttwetmore - I totally agree

Maybe this issue is really about simplicity vs complexity rather than size

pipian commented 12 years ago

I completely agree that XML is not as human-readable as other formats, but part of that is due to its stricter sense of interpretation, trying to make XML less ambiguous and more flexible than GEDCOM.

On the other hand, you are correct in that I disagree with the practicality of that use case. While you may indeed question the data stored in any given textual format without confirming that it actually contains what you think it does, I would contend that the majority of the end users of the format simply would not care about how their data is represented as long as they can load it into their genealogical editor of choice. It could be binary or textual, be serialized in XML or YAML, or even use English or Spanish element names in XML, and they would not care as long as their program could open it.

Now I will grant you that this is a particularly narrow interpretation of the tolerance of end users. There is a large minority of users who switch between different genealogical editors, or who edit in one application (say Gramps) and display their findings using another (for example, phpGedView). For this, these applications require interoperability and thus need to understand a mutually intelligible language. This has traditionally been GEDCOM (5.5 or whatever previous version), and it is this role that GEDCOM X intends to fill (as I perceive it). But what I think should be remembered is that the average end user still does not care about the format (whether GEDCOM 5.5 or GEDCOM X) as long as it is supported by both applications. Thus, at least in principle, I don't see why the average user would actually look at the textual representation.

What's more critical then is making sure that the design of the specification is one which encourages programmers to add it to their program. What are the reasons why GEDCOM X is better than GEDCOM 5.5? Why would Gramps or other genealogy applications implement GEDCOM X? Why WOULDN'T they implement GEDCOM X? What is the value that is gained from adding GEDCOM X support than GEDCOM 5.5 support does not already offer?

In this way, I would contend that the exact format (XML, JSON, YAML, etc.) is only relevant to application takeup in so far as they are or are not difficult to add parsers for. XML has an advantage in that parsers for XML are ubiquitous, so it is much easier to write code to interpret an XML formatted file than a brand new serialization that is not supported by anything. The end-user probably isn't going to reject an application just because it uses a data format which is XML instead of YAML, so that's less of a concern.

EssyGreen commented 12 years ago

I would contend that the exact format (XML, JSON, YAML, etc.) is only relevant to application takeup in so far as they are or are not difficult to add parsers for

I totally agree but if we follow that logic to its conclusion we have to ask why we're changing the serialisation format at all. Let's not go there! I actually agree with you re XML but let's keep the model reasonably simple.

Going back to John's original post, my vote is for his last/shortest version.

ttwetmore commented 12 years ago

My prejudices are definitely showing. And I am very glad FS has opened up the GEDCOM-X discussion to outsiders and that they are taking their (in my care, our) opinions seriously. Which means I will be happy with pretty much anything FS comes up with in the end.

The claim that developers will be more likely to use XML because there are standard parsers for it is simply one I cannot let go without a little pushback, as this argument is used INTERMINABLY by the XML aficionados as if it were some holy grail. What in the world did we poor engineers do before XML and its two parsing philosophies (DOM and SAX) existed? It must have been some horrible dark age.

XML is trivial to parse, JSON is trivial to parse, GEDCOM is trivial to parse, and any other regular format that would be chosen for genealogical data is trivial to parse. The languages are so simple and regular that you don't even need to use any type of sophisticated parsing technology.

What percentage of the code in any genealogical application is the import data parser? To get one data point on this question I just went back and analyzed the code breakdown on my ancient LifeLines program:

genealogical library -- 28.8 % (6397 LOC) [GEDCOM parser is 81 LOC in here]
graphical user interface -- 28.5 % (6317)
report interpreter -- 28.2 % (6260)
header files -- 4.8 % (1074)
utility library -- 4.8 % (1065)
database -- 4.8 % (1065) [hand-crafted B-Tree]

Total lines of code -- 22178

The GEDCOM parser is in one of the files in the genealogical library. The function that does the GEDCOM parsing is 81 lines of code, including comments and syntax error handling (and a few blank lines!). That is, the parsing code makes up ONE THIRD OF ONE PERCENT of the total code "mass" of the LifeLines project.

So though I will agree that the use of XML might allow a developer to write one third of one percent less code than they otherwise might have to write, I won't agree that this is a strong argument in favor of XML. I personally wouldn't even agree that it is an argument in favor of XML at all.

Here is the LifeLines GEDCOM parser. This code does no semantic checking, which is done elsewhere, but nor do XML parsers do any semantic checking. So this is an apples to apples comparison.

/*================================================================
 * buffer_to_line -- Get GEDCOM line from buffer with <= 1 newline
 *==============================================================*/
static BOOLEAN buffer_to_line (p, plev, pxref, ptag, pval, pmsg)
STRING p;
INT *plev;
STRING *pxref, *ptag, *pval, *pmsg;
{
        static char zero = 0;
        INT lev;
        extern INT lineno;
        STRING p0 = p;
        static char scratch[MAXLINELEN+40];
        *pmsg = 0;
        *pxref = *pval = &zero;
        if (!p || *p == 0) {
                sprintf(scratch, reremp, lineno);
                *pmsg = scratch;
                return ERROR;
        }
        striptrail(p);
        if (strlen(p) > MAXLINELEN) {
                sprintf(scratch, rerlng, lineno);
                *pmsg = scratch;
                return ERROR;
        }

/* Get level number */
        while (iswhite(*p)) p++;
        if (chartype(*p) != DIGIT) {
                sprintf(scratch, rernlv, lineno);
                *pmsg = scratch;
                return ERROR;
        }
        lev = *p++ - '0';
        while (chartype(*p) == DIGIT)
                lev = lev*10 + *p++ - '0';
        *plev = lev;

/* Get cross reference, if there */
        while (iswhite(*p)) p++;
        if (*p == 0) {
                sprintf(scratch, rerinc, lineno);
                *pmsg = scratch;
                return ERROR;
        }
        if (*p != '@') goto gettag;
        *pxref = p++;
        if (*p == '@') {
                sprintf(scratch, rerbln, lineno);
                *pmsg = scratch;
                return ERROR;
        }
        while (*p != '@') p++;
        p++;
        if (*p == 0) {
                sprintf(scratch, rerinc, lineno);
                *pmsg = scratch;

                return ERROR;
        }
        if (!iswhite(*p)) {
                sprintf(scratch, rernwt, lineno);
                *pmsg = scratch;
                return ERROR;
        }
        *p++ = 0;

/* Get tag field */
gettag:
        while (iswhite(*p)) p++;
        if ((INT) *p == 0) {
                sprintf(scratch, rerinc, lineno);
                *pmsg = scratch;
                return ERROR;
        }
        *ptag = p++;
        while (!iswhite(*p) && *p != 0) p++;
        if (*p == 0) return OKAY;
        *p++ = 0;

/* Get the value field */
        while (iswhite(*p)) p++;
        *pval = p;
        return OKAY;
}
EssyGreen commented 12 years ago

@ttwetmore I don't disagree (except in one glaring omission re the preference for ANSEL) but it does seem an appropriate time to make an "upgrade" to XML - GEDCOM 5 is so nearly there anyway.

ttwetmore commented 12 years ago

@EssyGreen I can live with XML. I have used it many times with both DOM and SAX parsing; SAX can be a lot of fun (no pun intended). I merely wish to cleanse XML it of its religious overtones and the frequently insuffrable righteousness of its adherents.

(NOBODY uses ANSEL anymore. Thank goodness.)

EssyGreen commented 12 years ago

LOL! Like I said before I don't disagree :) PS: Nobody uses ANSEL but the spec still says its preferred. Picky point I know but sortta shows how things get get out of date very quickly/easily and they are especially a pain when home-grown

ttwetmore commented 12 years ago

@EssyGreen You're right about ANSEL. In my code I felt I had to deal with ANSEL in case any of it ever came my way. First and foremost it is difficult even getting a good definition. Second of all, some of the funny accented and otherwise distorted characters it contains are not defined well enough to know what they really are -- that is, there are some that you can't definitively map to a Unicode equivalent.

So I ended up with a method that I call that takes a raw buffer of bytes, and by various heuristics and tabulations tries to deterministically decide whether it is ASCII, Unicode or ANSEL (or a couple other pre-Unicode common 8-bit character formats). In many cases there is a deterministic answer, but in a few cases you just have to pick one and run with it.

lkessler commented 12 years ago

Tom,

I completely agree with your points. Especially the fact that there are standard parsers for XML should have no bearing due to the simplicity of the parsing.

There are 500 programs out there now that have GEDCOM parsers built into them. There are a half-dozen open source GEDCOM parsers in various languages out there. There is no reason why, if GEDCOM X became ubiquitous, programmers would not write their own parsers for it, hence resulting in customized open-source parsers made for it.

In fact, most programmers would want to implement their own efficient parser for GEDCOM X because most of the DOM and SAX parsers are too darned slow.

If the reason why XML is preferred is because then a generalized XML reader could read the GEDCOM X file, then I say that's not a good reason. The data read will have no context and be unable to connect the dots or present the relationships properly. Only a genealogy program that knows what the data means will be able to do that.

I too don't mind an XML solution. But if pipian is worried about application takeup (and he and GEDCOM X should be), then as a programmer, the best way to convince me to start using GEDCOM X is with a simple format that is readable (as Tom says), has as little as possible standardization crap (headers in every file - ugh), avoids repeating closing tags whenever possible by using elements, and is S-I-M-P-L-E!

Louis

jralls commented 12 years ago

All very well, but no namespaces and no RDF fails a critical use case: FamilySearch's own, which is for a semantic web format for their new Family Tree project. Let's keep that discussion in #185.

ttwetmore commented 12 years ago

@jralls Which also is very well. But this emphasizes the point that GEDCOM-X's real goal might be to be an internal FS model to enable its own applications, rather than a generic format to be used to form the backbone for sharing data between run of the mill genealogical applications.

Personally I don't mind whichever direction FS takes GEDCOM-X. However, if it does not have the goal of becoming the transport and archival format for the next generation of genealogical software systems, we must admit that there is a void that still needs to be filled.

EssyGreen commented 12 years ago

if it does not have the goal of becoming the transport and archival format for the next generation of genealogical software systems, we must admit that there is a void that still needs to be filled.

++1

EssyGreen commented 12 years ago

... or else give up and go home :)

ttwetmore commented 12 years ago

@pipian Thanks for explaining your ideas so well.

Before we accept that GEDCOM-X needs a number of namespaces, with capabilities for extension, it would be useful to understand GX's requirements. If the main goal of GX is to express names, vital events with dates and places, sources and biological relationships, this is an easy task, and a simple, customized vocabulary of tags with no namespace is sufficient. Old GEDCOM is already almost fully capable of this. If GX is to be a vehicle for expressing complex ideas about all aspects of a person's life, say as a way of formalizing a fully researched biographical document, where every nuance of the research, every nuance of the author's thought process, and every nuance of the resulting report must be semantically extractable, then things are clearly open-ended and who knows where things would lead. But such applications don't exist in the genealogical software space, and there doesn't seem to be a clamoring for them. If we had such a model, however, combining and extending namespaces at will, and with RDF vocabularies to drive semantic-based software, maybe the scales would fall from our eyes.

Reading between the lines, I beleive that FS has two main goals for GX:

1) being the model for its on-line pedigrees. 2) being the model for digitally extracting field-based data from physical records.

Neither of these goals, in my opinion, calls for multiple XML namespaces, or the use of RDF as an explicit concept. Everyone and their grandmas invoke Dublin Core for source material in this. I say grab the obvious terms from there if you like and forget the namespace.

Maybe there are indeed more complex desires for GX. However I believe that the pressures that are being put upon FS by their management is making it hard for them to stop, backup, and do a requirements document the justice it really deserves. As most of us would probably do, FS is just winging it wrt GX requirements.

I wish to close by pointing out, though I am sure you are absolutely aware of it, that any structured representation is already in RDF form. For example:

1 BIRT 2 DATE 18 December 1949

where BIRT is the subject, DATE is the predicate, and 18 December 1949 is the object. All these structures can be interpreted describes things that are particular type of properties of other things. And its recursive all the way up and down the structures. XML is structured that way; JSON is; and good ole GEDCOM is. We've never had to call this RDF before, but I guess now it becomes important, as a means of demonstrating that we don't have to add RDF to our notations because we already have it. If we specify the subject and predicate tags in the specifications, and we define the spaces that the objects come from, we get all the benefits of RDF with never even whispering its name. No matter what you put into an XML or JSON file, a simple automaton could generate the list of all the semantic triples it contains.

pipian commented 12 years ago

@ttwetmore I wholeheartedly agree that what's needed here is a firm requirements document. A lot of the whinging on here seems to be without a firm understanding of what is actually needed for the use cases FS has in mind. If we have a requirements document though, it would help to ground the debate and actually focus discussion in more useful debates than offering arguments for and against different serialization formats. Perhaps requesting or establishing such a document is a topic for a new Issue?

In the same vein, FS (or the GEDCOM X team) sorely needs to do more reaching out to developers to get their input as to their use cases, if they really want buy-in from the genealogy software community at large. A new "standard" is hardly going to actually BECOME the standard if it doesn't actually connect with the needs of the real stakeholders (e.g. developers of genealogy software). That might be needed even before talking about requirements documents.

EssyGreen commented 12 years ago

what's needed here is a firm requirements document [...] FS (or the GEDCOM X team) sorely needs to do more reaching out to developers to get their input as to their use cases

+1

stoicflame commented 11 years ago

With the latest decisions at #183, #184, and #185 and the update of the file format specification, the conversion library has been updated and 0.2.0 has been released. (A more formal announcement is pending.)

We believe the latest changes and updates have addressed this issue. A 24 MB GEDCOM file is converted to a 4.2 MB GEDCOM X file.