FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
359 stars 67 forks source link

settle on the file archive format #184

Closed stoicflame closed 11 years ago

stoicflame commented 12 years ago

Analysis of the GEDCOM X file format shows that the ZIP file format is significantly inefficient under certain scenarios. As defined today, the GEDCOM X file format takes a massive hit from those inefficiencies.

This issue is opened to discuss alternatives and settle on one. In addition to ZIP and any others mentioned in this thread, the following alternatives are worth considering:

stoicflame commented 12 years ago

(Now that I've opened up the issue, I'll take the time to register my personal opinion.)

My current preference leans towards the (gzipped) MIME Multipart format, although I'd be sad to see the random-access feature of the ZIP file go away.

ttwetmore commented 12 years ago

Zip is fine. What is the issue with removing random access?

stoicflame commented 12 years ago

Zip is fine.

Even if it doubles the file size? So you're okay with the overhead?

What is the issue with removing random access?

I mean that if we don't use ZIP, we don't get random access anymore.

ttwetmore commented 12 years ago

My concern is with all the namespaces and URLs that make the GEDCOMX records nearly indecipherable to mere mortals. This is a concern that would not be shared with others who believe that human beings have no business looking at GEDCOMX data. I remain convinced it is important and that there will always be a need for low level GEDCOMX editors.

It was the namespaces, URIs and other XML overhead that accounted for the majority of the size expansion going from GEDCOM to GEDCOMX, not zip overhead. I think we can live with an expansion of 2x over GEDCOM.

If the format were JSON I would predict that the expansion factor would be less.

stoicflame commented 12 years ago

@ttwetmore, I think you might be missing some context. Did you read this?

ttwetmore commented 12 years ago

Thanks, Ryan, I hadn't seen that. Nice write up.

A couple comments. Your point about efficiency of validating GEDCOM can be modulated a little bit. The GEDCOM validator I wrote for LifeLines is a two pass system that can scale up much further than the types of validators you are referring to. Yes it has to validate the entire object graph, as you so succinctly put it, but it doesn't have to hold the entire contents of the file in memory in order to do that. It does have to hold some information, of course, so it can't scale up to infinity. My current GEDCOM code can read and validate GEDCOM files that are 100s of megabytes in size at a rate of over 40 thousand records per second.

All that being said, of course, using a random access system like zip doesn't have any scaling problems since you can just reread what you need at any time. Of course, you're going to spend a lot more time in the file system library than I do with my fast GEDCOM library.

What do you feel is the real advantage of the random access feature? What is the downside of placing the entire contents of a GEDCOMX transmission into a single container file? Well, you've just pointed out the validation efficiency issue. Is there anything else? I mean, the only purpose of the file format is to archive and transmit information between applications (isn't it?). Once that data reaches an application, how hard is it to look inside a container file? How bad is the validation issue really? We can all answer than for ourselves. Reading a container file we would obviously separate the data into a set of database objects and store them away. We could do that without validation and then run a validator on the database representation using the same algorithms we would use to validate the random access zip file.

I think the random access feature is fine, and I'd be willing to accept a doubling in archive file format for the feature. But I also don't believe that working around not having random access would be all that onerous. Maybe this is a long-winded way of saying that I don't think it's a very important issue.

Let me once again plug MongoDB. GEDCOMX objects map beautifully to MongoDB "documents". In fact you could treat a MongoDB as simply the persistent backing store that holds pure GEDCOMX objects. You could read a random access GEDCOMX zip file directly into a MongoDB or a container file version that is broken into objects as the file is read, and then run all your validation at database speeds rather than at file system speeds.

stoicflame commented 12 years ago

Thanks, @ttwetmore.

So I think I agree with you, hence my vote for moving to a gzipped MIME Multipart archive format.

So if you had to vote...?

nealmcb commented 12 years ago

Here is a bit more data on the way zip overhead stacks up. The zip file examples discussed here have more overhead than necessary, since they include extra-precision timestamp data and unix uid/gid values that we don't need, resulting in an overhead per file of perhaps 128 bytes plus twice the length of the filename. It looks to me like if the extra stuff is left out, the marginal overhead per file is 76 bytes (30 for File Header and 46 for Central Directory entry) plus twice the length of the filename.

To a large degree, even this degree of extra overhead (compared to tar etc) comes from having metadata like file name and other timestamps duplicated in the Central Directory at the end of the file (where it is useful for random access) and in the File Header before each file. A nice description is https://users.cs.jmu.edu/buchhofp/forensics/formats/pkzip-printable.html

But, note that the Central Directory might be pretty large for an archive with lots of little files in it. What would we expect the average size of a record to be? If it is really small, then it might not be much less space-efficient to read the whole archive in than it would be to read in the whole Central Directory to do random access, making the latter not as helpful as one would like....

ttwetmore commented 12 years ago

I would vote for JSON-based objects in a zip/jar format with each externally and uniquely identifiable object as a separate file in the zip.

jralls commented 12 years ago

From the blog post:

But the results were still disappointing given the fact that the original GEDCOM 5.5 file was 15 MB. ... Take note: the file size is 52506493 bytes. The compressed size of the data is 24223883 bytes. The overhead of the ZIP file format more than doubles the size of the data.

Ryan, can you walk through how you calculated the size of the actual GedcomX data? Also, how much of the 15MB did the Gedcom5-GedcomX conversion tool discard, given that it doesn't convert most notes? You'll probably need to extend the conversion tool to round-trip the data to figure that out.

How big is the same GedcomX file if it's encoded in JSON+MIME-multipart?

-rw---- 2.0 fat 17682587 bl defN 12-Jul-02 10:15 META-INF/MANIFEST.MF

Given that ZIP maintains its own catalog, is the manifest getting us anything besides 30% of the file size? What's the compression factor if you take it out?

EssyGreen commented 12 years ago

I'd vote for zip because it's so common place and easily recognised ... not the same criteria as you guys I know but my money's going where I think it will be durable.

stoicflame commented 12 years ago

Ryan, can you walk through how you calculated the size of the actual GedcomX data?

I just pulled it from the zipinfo:

196741 files, 52227556 bytes uncompressed, 24223883 bytes compressed: 53.6%

Also, how much of the 15MB did the Gedcom5-GedcomX conversion tool discard, given that it doesn't convert most notes? You'll probably need to extend the conversion tool to round-trip the data to figure that out.

Yeah, that's a good point. It would take some more work to include that, but presumably the ZIP overhead would be less significant in that case.

How big is the same GedcomX file if it's encoded in JSON+MIME-multipart?

Again, I don't know for sure. I'd have to write a MIME-multipart writer. My best estimate is that it would be comparable to the gzipped tarball. I.e. close to 10 MB as opposed to 50 MB for the equivalent ZIP file.

Given that ZIP maintains its own catalog, is the manifest getting us anything besides 30% of the file size?

Yes, that's where the self-description metadata sits.

What's the compression factor if you take it out?

Archive:  G675.jar
Zip file size: 53307847 bytes, number of entries: 141
-rw----     1.0 fat        0 b- stor 12-Jul-13 14:49 G675/
-rw----     1.0 fat        0 b- stor 12-Jul-13 14:49 G675/descriptions/
-rw----     2.0 fat       91 bl defN 12-Jul-13 14:27 G675/descriptions/S004200-1033e
-rw----     2.0 fat       90 bl defN 12-Jul-13 14:27 G675/descriptions/S005074-7a85
...
-rw----     2.0 fat      629 bl defN 12-Jul-13 14:27 G675/persons/I17233
-rw----     2.0 fat      216 bl defN 12-Jul-13 14:27 G675/persons/I25131
-rw----     2.0 fat      537 bl defN 12-Jul-13 14:27 G675/persons/I37431
-rw----     2.0 fat      307 bl defN 12-Jul-13 14:27 G675/persons/I08156
196749 files, 34545040 bytes uncompressed, 23056983 bytes compressed:  33.3%

So removing it doesn't buy you much.

jralls commented 12 years ago

Given that ZIP maintains its own catalog, is the manifest getting us anything besides 30% of the file size?

Yes, that's where the self-description metadata sits.

Which consists of 17 MB of ''' Name: descriptions/SOURCE1-2 Content-Type: application/x-gedcom-conclusion-v1+xml ''' Compressed to 1.4MB, admittedly, but still...

196749 files, 34545040 bytes uncompressed, 23056983 bytes compressed: 33.3%

So removing it doesn't buy you much.

Yeah, that's what I was expecting: Most of the compression is from that one file.

nilsbrummond commented 12 years ago

Hi, I'm new here and just being browsing and seeing what is going on so far, but...

I don't think this should be a priority 1 issue. I see a number of things you need to settle first.

  1. The logical data model is changing still. Which leads to:
  2. Your sample data format is changing.
  3. Your sample data is too simple and does not seem to include data from the GPS process as far as I can tell.
    • Support for the GPS is the only requirement for this project I have found so far. Did I miss a requirements doc? I could not find any formal one.
  4. You need to determine if random access is a requirement or not.
    • This I think would fall out of if GEDCOMX files need to be optimized to run directly out of? Or if they are expected to be imported to an application native optimized DB?

The data your basing this analysis is invalid, and you don't have the requirements to guide your choices. This issue seems dependent on resolving these to me.

ttwetmore commented 12 years ago

@nilsbrummond Good to have more involved.

The model is the priority 1 issue, but the archive issue is nearly orthogonal. No matter the final model, there will be "top level" objects (have unique ids referred to by other top level objects) of a relatively few sorts. The requirements as to what the data in the those object types mean or how it should be formatted are nearly orthogonal to the archive issue.

The main issue regarding the archive format seems to be that of allowing random access to the top level objects. Random access is useful in a number of ways:

  1. To allow a validation program or validation front-end to efficiently validate the structure of the objects in the archive before taking the step of importing the data into a database. To do this it would be very convenient to have random access using the unique id's as the keys. The issue is one of scaling up the validation to very large archive files (100s of millions of objects) without having the objects preinserted in a database to help you out.
  2. To allow an importing filter to selectively choose what data from an archive to actually import. For this it would be very convenient to have a name index as well as random access via the unique id's.
  3. To allow programs to provide a summary of the contents of archive files by only referring to a few indexes in the archive.

Random access is not a requirement for these features to work, but it is a convenience making them easier (in some cases much easier) to implement. As you implied, with the proper indexes and random access the archive format can be used as its own fairly primitive, but still very effective database file, which would be very useful for utility programs of many sorts.

The requirements around the archive format are more performance requirements than the functional requirements related to details of the model.

nealmcb commented 12 years ago

Another option is to allow for a bunch of basic objects to bundled into a file, at the discretion of the application that produces the archive, and also allow for one object per file to facilitate complete random access. Apps that want to produce tight data files for which random access is deemed unimportant can get them, and other apps producing large archives, perhaps full of multimedia for which random access would really be helpful, can do it that way.

ttwetmore commented 12 years ago

@nealmcb There can be different archive formats, and each would have to be fully specified and every GEDCOMX-compliant program would have to be able to import data from every format. It sounds like you are thinking that it might be good to have both a random-access format and a purely serial format.

The decisions to go random or non-random and many files versus one file are independent; that is, all four combinations are possible. Random access is possible in a single file implementation by using an index of byte offsets (well, yeah, that index file would be a second file). And non-random access is obviously possible in a multi-file implementation by not creating an index.

If GEDCOMX goes multi-file (within some container I hope we all assume), the difference between random and non-random access is really just the presence of extra files that provide the index/manifest. One could easily imagine GEDCOMX specifying the format of those extra files, but not making it a requirement that they exist.

nealmcb commented 12 years ago

@ttwetmore If we stick to the current idea of a zip file (which still seems advantageous to me), there is no option to not have an index of sorts ("Central Directory") - that is what defines the zip file. My point is just that the archive doesn't have to have a separate file (and duplicated space-consuming metadata) for each individual entry. You just need to allow multiple entries per file. Of course requiring apps to handle that is a bit of an extra burden and not quite as clean, but if file space is a significant issue, this approach would mitigate it a lot.

EssyGreen commented 12 years ago

welcome @nilsbrummond :)

I don't think this should be a priority 1 issue. I see a number of things you need to settle first

++1 Excellent points!

On the subject of random access - I personally don't believe this is a big issue since GEDCOM X is invariably going to be imported and hence the app will have to read all the data anyway (even if only to discard/warn about non-support). Even if the data is being selectively imported, the app will have to know what's in it and what the dependencies are to be able to intelligently present options for the user.

nealmcb commented 12 years ago

We've discussed this before, but let me reiterate that one of the big reasons for wanting random access is when we distribute multimedia along with genealogy files. The pictures and audio recordings and such can dominate the overall size of the archive, but are not needed for all uses of the archive. Some apps will sometimes either want to specifically access a particular multimedia file, or won't support multimedia, and will want to efficiently ignore it.

I'd love to, for example, be able to bundle information about an individual or a family, in a single archival-quality, signed, properly-sourced, complete and consistent way that could be shared easily. A random-access gedcom x archive would be great for that, and some apps might just use gedcom x as their own default format. Such data would be useful for areas outside the traditional genealogy market. If it was a zip format (or other widely-used one), perhaps with some embedded html or other explanatory material, it would be more likely to be adopted for those sorts of purposes.

On a separate note, I wonder if we really want to require a MANIFEST file. How hard would it be to define an "extra" field within the Central Directory or File Header for the mime type?

And conversely, existing apps that use zip for shipping stuff around

EssyGreen commented 12 years ago

@nealmcb - I see your point of view (and I agree about the neatness of the mini-archive) but it's not a biggy for me - I'm guessing FS will be providing a free reader which will do the sort of thing you're asking so I'm happy to leave it to that app. to do the hard work. I'm not voting against it so much as just not voting for it :)

ttwetmore commented 12 years ago

@nealmcb Thanks for the point about zip files always having a directory. Though I did know this, I had erroneously assumed that random access would require an additional index file. However, if the internal references are actually the file names (or conversely the file names are the unique ids) then the directory file is the index file.

It could be very useful,for there also to be a name based file in the zip file, a file that maps names to persons.

However, as you have made clear, you get random access for free just by using zip. Once again, my vote is for zip.

pipian commented 12 years ago

To pitch in something from the RDF side of things, one typical practice when offering linked data (especially ontologies) is to make use of distinct fragment identifiers relative to a single URI path. This (usually) means that multiple resources can be described in a single file, rather than having hundreds or thousands of smaller files (especially when one resource may be linked to many others). Of course this results in the same issue as GEDCOM 5.5, where you now have to parse the entire RDF graph in order to make sense of the links.

The problem with ZIP is (I believe) that it compresses each file independently from all others. This means that you actually get the explosion in filesize since the ZIP format cannot make use of a global compression index. Of course the flip side is that completely compressed files cannot make as good use of random-access, since they often don't have indices into the compressed data; TAR doesn't support an index (at least not without doing some jiggery-pokery and adding an extra constraint to the TAR format for our use to ensure that an index file is stored first). MIME multipart has the same problem as TAR: Unless you iterate through the entire file, you have no idea what the offsets of the individual files are.

That said, employing either zran.c from Ghostscript or seek-bzip to create indexes into gzipped or bzipped data might be appropriate if you still want random access. Not sure about how expensive that would be though.

joeflint commented 12 years ago

As a .NET programmer it took me awhile to realize that I was looking at a JAR file format. That's cool. While I personally wouldn't mind if all the files in the persons folder, for example, where combined into one file. I think my preference is really one of habit. I understand the need to keep multimedia files separate.

I definitely support the use of ZIP over other less common formats.

Also this format of folders allows for easy extensibility. If there is a need for data (DNA) not yet in GEDCOM X it could be placed in extensibility specific folder/s. Only those interested in the data would access the folder. If there was enough interest in the data it could be promoted into the GEDCOM X standard.

I really don't get this obsession with file size. The era of floppy disks 300 bit/s modems is over. I am much more interested in the functionality I get from an archive or serialization format.

stoicflame commented 11 years ago

In preparation for the pending milestone 1 release of GEDCOM X, we are making the final decisions on the nature of the file format. The file format specification has been updated to reflect our decisions.

For this particular question, ZIP has been ratified as the archive format.