FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
356 stars 67 forks source link

Base file format on something like ZIP rather than MIME #140

Closed nealmcb closed 12 years ago

nealmcb commented 12 years ago

I love how gedcomx is generally based on modern formats and api approaches. But I'm struck in GEDCOM X: File Format by the use of the ancient MIME format, which was designed to be compatible with email formats that date back many decades. I'm a fan of MIME for email, but not as a generic container.

MIME has many disadvantages. For example, it:

The availability of tools for MIME compared to

Please move to a format that resolves these issues. The first one that comes to mind is the ZIP (file format) used by Open Document, and also used for Java (JAR), for Android applications in the form of APK (file format), and Microsoft's Office Open XML documents (OPC). The ZIP format is supported by a huge variety of tools and APIs and workflows.

(Corrected to focus on ZIP rather than JAR, as discussed below)

EssyGreen commented 12 years ago

Not my area of expertise so forgive my naivety here but why not just bog standard XML?

jralls commented 12 years ago

Please move to a format that resolves these issues. The first one that comes to mind is the JAR (file format) used by Open Document, and also used for Android applications in the form of APK (file format)), and Microsoft's Office Open XML documents. It is based on the ubiquitous ZIP format, for which a huge variety of tools and APIs and workflows is available.

That's silly. Jar and apk are just names attached to zip archives containing regular java programs and Android programs respectively. Neither Open Document nor OpenXML use either, since a) they're not java and b) single documents.

GedcomX uses mime's multipart feature to stitch together a bunch of separate XML documents. I think that's for RDF, which wants a separate document for each reference.

Your third point, about expanding the contents for storage, can be addressed by using any compression utility on the GedcomX file. Not a bad idea, but there are a lot of deeper issues to deal with first.

I suspect what you really want to propose is having a separate file for each GedcomX entity document and then zipping them into an archive, similar to what java and Android do with their jar and apk files. Yes?

What I don't know is whether that would work with RDF. Ryan?

nealmcb commented 12 years ago

@EssyGreen, the main challenge is including photos, audio, and other external file content. I'm sure there are ways to do that in XML, but I dare say they're flawed since they're huge and uncommon. And XML itself is verbose, so compression is a good idea anyway. Following a model like Open Document (ODF) just seems to make sense.

I can appreciate the goal of making files easy to view in a text editor, though XML is already a bit of a stretch there.

@jralls, I called what Open Document uses a JAR based on this quote at http://books.evc-cit.info/odbook/ch01.html:

Although the XML file format is human-readable, it is fairly verbose. To save space, OpenDocument files are stored in JAR (Java Archive) format. A JAR file is a compressed ZIP file that has an additional “manifest” file that lists the contents of the archive. Since all JAR files are also ZIP files, you may use any ZIP file tool to unpack an OpenDocument file and read the XML directly.

In the Open Document 1.1 standard itself it is described in section 17 on "Packages" as a ZIP archive. It has a manifest file, like a JAR file does, but with a different extension and format, so it looks like the book may be wrong, depending on the exact definition of a JAR.

Microsoft's Office Open XML and OpenXPS, along with a host of other formats use ZIPs also, according to Open Packaging Conventions - Wikipedia

I agree that the issue of references to specific parts of the archive is important. The Open Document spec above talks about how to use IRIs (an internationalized generalization of a URI) for internal references, so I'd hope it would work fine for GEDCOM X, but I'm not positive.

EssyGreen commented 12 years ago

@jralls

GedcomX uses mime's multipart feature to stitch together a bunch of separate XML documents. I think that's for RDF, which wants a separate document for each reference.

Aha! Thanks for the illumination! So ... if the constraint is RDF ... that sort of begs the question why are we using RDF?

@nealmcb

the main challenge is including photos, audio, and other external file content

I see your point but that is just the packaging mechanism for uploading (and possibly downloading) ... we could just as easily use a zip file with all media having a GUID filename which can be referenced from within the main XML (data) file.

jralls commented 12 years ago

the main challenge is including photos, audio, and other external file content. I'm sure there are ways to do that in XML, but I dare say they're flawed since they're huge and uncommon

Neither, actually. You can put in a reference of some sort and bundle up the binary into an archive of some sort (as you point out, zip is widely used) or you can bin64-encode the binary data and package it as CDATA in an XML document. Either way will compress to around the same size.

I agree that the issue of references to specific parts of the archive is important. The Open Document spec above talks about how to use IRIs (an internationalized generalization of a URI) for internal references, so I'd hope it would work fine for GEDCOM X, but I'm not positive.

There are a zillion ways to do internal references in XML documents, from simply having a unique id attribute to the element and retrieving it with code to XLinks to RDF. RDF is ugly, wordy, and inefficient for internal references, but it supports the RESTful behavior that web applications need.

Sarah, I'm speculating that RDF is the reason fro breaking the GedcomX document into lots of little XML documents. I don't know for sure. I speculated above about the why of RDF, but I know FamilySearch is very much interested in web apps, far more so than they are in the stand-alone apps that you and I care about. See for example many of @carpentermp's posts.

EssyGreen commented 12 years ago

I understand the web focus and I can understand the commercial preference for packaging everything into one file but this is just the transport mechanism.

I would also just like to add a word regarding copyright and privacy .... personally, the vast majority of my media files would be unsuitable for publication since they would breach copyright or privacy. I believe that GEDCOMX should discourage these breaches rather than encourage wholesale reproduction of sources. I understand that this is more of a business logic point but the focus on packaging all the media files in with the data file is, I believe, a step in the wrong direction.

As a genealogist, when I am taking sources from the web I rarely bulk download media files since I need to inspect each source individually. So the only time when bulk inclusion of files is relevant is when the researcher is publishing their data to the web (whether that be to share with a selected group of contributors or to make it publicly available).

When publishing, great respect must be paid to copyright and privacy. The reason that citation styles (or source meta data if you prefer it that way) are so important is so that:

(a) credit is given where it is due (b) the reader can find the (re)sources quoted without the need to reproduce them all (and hence breach copyright)

If I buy a book which references other books, I don't expect it to come with copies of all the books it references, so why should GEDCOMX focus on doing that?

jralls commented 12 years ago

copyright and privacy

Good points both.

At present GedcomX provides no mechanism for including images and audio, though Ryan has indicated it as a goal elsewhere.

I suspect that one of FamilySearch's motivations for GedcomX is to provide a vehicle for you to download from them the image, the source information, and the interpretation (to use your preferred term ;-) ) of the image's content. I suspect that they also want to be able to make an RDF reference to that bundle to insert into their online tree.

Anyway, back to copyright: If GedcomX does grow the ability to include media inline it will need to tag that media with copyright and licensing metadata.

nealmcb commented 12 years ago

There is some related discussion in #74. See e.g. the references to Message Transmission Optimization Mechanism in https://github.com/FamilySearch/gedcomx/issues/74#issuecomment-2139139 which show more modern and efficient usage of MIME. As a side note it also led me to XML-binary Optimized Packaging for efficient transmission of binary content even if it is in the XML infoset by pulling base64-encoded stuff out of the XML during transit.

@EssyGreen I would be less concerned if this was only for use as a transport mechanism for the online API. But the first purpose defined at GEDCOM X: File Format is using it for file storage. I'd rather see a file storage standard that handled indexing, random access and the like also.

EssyGreen commented 12 years ago

@jralls

I suspect that one of FamilySearch's motivations for GedcomX is to provide a vehicle for you to download from them the image, the source information, and the interpretation

That was my assumption too tho' personally I would throw away their "interpretation" and do my own.

@nealmcb

the first purpose defined at GEDCOM X: File Format is using it for file storage

If you mean it should be used as the data source for all genealogical applications then I believe this is a pipe dream - all application providers will all want to model their data to their own USPs and will merely use GEDCOM to import/export. (Since GEDCOMX is, I believe, sponsored by FamilySearch then they may be the one exception). It is not feasible for GEDCOMX to provide a model which will give everyone what they want. It can only provide a standard minimum spec. which (we hope) will be upheld (and for it to be upheld it must be seen as "best practice").

nealmcb commented 12 years ago

@EssyGreen "File storage" is a pretty broad term. As defined now, of course programs wouldn't use this as their native file storage format. Even PAF doesn't use GEDCOM except as an export/import format.

But I'd certainly hope that GEDCOM X would define a file storage format suitable for general-purpose interchange, and that it would be useful for more than just API transport. E.g. it should be suitable for use like GEDCOM is now - when I want to package up the stuff I've put together and send it to my mom. I'd think we wouldn't always have to use some web site operating as an intermediary, which can only be accessed using a program that knows the API.

And it seems clear that it should include multimedia objects also. That seems a logical extension of GEDCOM. Which I'm suggesting leads to wanting a format with an index so it doesn't require scanning the whole file just to find out what's in it.

EssyGreen commented 12 years ago

for general-purpose interchange, and that it would be useful for more than just API transport

Yes agreed

And it seems clear that it should include multimedia objects also

If I wanted to include multimedia I'd just use a zip. What I don't want is a package which auto-includes all my media and hence breaches copyright and privacy (like FTM Sync does now). Admittedly this is an application issue and not a package issue but I can't see the benefits of packaging all the files in a non-standard way when bog standard zip is so widespread.

nealmcb commented 12 years ago

If I wanted to include multimedia I'd just use a zip

The trick is having the references work inside the zip, like they do in Open Document. So the application needs to be aware of the zipping. You don't want people to have to zip and unzip the contents and do some sort of configuration on where medial files are expected to be just to get their application (or some other application) to be able to show the sources and photos.

Also note that we already have mechanisms for marking some parts of a tree private, for specifying that some branches or aspects should be included or not, etc. Of course people should have good tools for controlling what goes into a particular archive, and they should be able to deal with aspects of copyright and licensing.

If you just zip the data up yourself, outside your application, you get no help from the software for dealing with those copyright issues or privacy issues. So surely we want to specify how things all hold together and how the users and rights holders get to manage things.

EssyGreen commented 12 years ago

The trick is having the references work inside the zip

Fair enough. I see your point. And I guess it's up to the application to ensure it uses the AccessRights appropriately.

stoicflame commented 12 years ago

Hi.

MIME Multipart has been around for a long time, yes. But it's still used all over the place. HTTP has been around a long time, too. So for the purposes of this discussion, it would be great if we could focus on specific requirements that are(n't) being met by the current proposal.

The only real problem being cited that I could find was a lack of support for "efficient byte-offset indexes and selective reading of the file". Fair enough. How big of a deal is that, really? The intent of the file format is for file-transport purposes, so presumably you'd be processing the file in its entirety anyway. Why do you need selective reading of the file?

As for the lack of an index, that's actually not true. MIME Multipart is designed to support an index, if needed. We'd need to define it, but we can address that in another issue; it's not a limitation of the MIME format.

As for compression, each part of the document can be compressed. It's already defined ways to do that with the "Content-Encoding" header.

For the record, RDF had nothing to do with the selection of MIME Multipart as the initial proposal. Much of the selection process was considering how much effort we're putting into making this standard accommodating for web services. We're defining MIME types, URIs, id references, etc. for the purposes of using these things in web services and it all just fit so perfectly into MIME Multipart. MIME Multipart even defines ways for objects in the file to reference other objects with URIs. Sweet!

If we went with something else, we'd have to go through the mess of defining those mechanisms ourselves. It certainly can be done, but I must say that it would sure be nice to not have to do it.

And I must say that as a developer, I kinda like the fact that I can read a file with a standard text editor.

jralls commented 12 years ago

If we went with something else, we'd have to go through the mess of defining those mechanisms ourselves. It certainly can be done, but I must say that it would sure be nice to not have to do it.

No, XML has several ways to define links between elements or documents, from simple ID attribute references connected by the consuming application through complex XLinks. RDF itself is an XML linking mechanism.

But it's not really important. It's trivial to strip out the MIME Multipart boundaries and the extra DOCTYPE elements to produce a single XML document for parsing and a trivial XSLT stylesheet can reverse it.

nealmcb commented 12 years ago

@stoicflame, Thanks. I agree that focusing on the requirements is the proper approach. I had already managed to convince myself that MIME now handles binary data just fine, including efficiently compressed data, unlike the original way it was standardized. So we agree about the remaining requirement from what I've written that the current gedcomx file spec doesn't talk about, which I'd summarize as supporting efficient random access to objects in the file.

I don't see why we'd assume that the receiver would necessarily process the whole file. A recipient is often interested in only a particular part of a family tree, and may well want to selectively import data. Having to wade through images, videos, etc which are unrelated to the specified part of the tree would just slow things down.

I think most of the current requirements are handled well by ZIP, as defined in the Open Document and ZIP specs: indexes, internal URLs, multimedia objects, etc.

I'm not sure about about how MIME types fit in here. Should the MIME type of a multimedia object be a part of the metadata about that object in the gedcomx data model? That would make sense to me, offhand. In that case, the container format itself wouldn't need to have a separate mechanism for specifying MIME type. If not, while there may be common ways in which that is done for cross-platform ZIP archives, I'm not sure what they are or how robust they are. Using file extensions isn't very robust in my experience.

I haven't run across indexes for MIME - could you point to something along those lines?

stoicflame commented 12 years ago

I think most of the current requirements are handled well by ZIP, as defined in the Open Document and ZIP specs: indexes, internal URLs, multimedia objects, etc.

Fair enough.

I guess what I'd need is a more detailed alternate proposal. Basically, a page to compare to the page for the existing proposal. I don't have enough information to comment on what's better and why.

What do you say? Want to take a stab?

I'm not sure about about how MIME types fit in here.

The requirement is that each "entry" in the file be able to describe itself in terms of it's MIME type. This is so applications can know how to read it and interpret it. This is a pretty well-established and well-proven pattern which allows for things like versioning, extensibility, and alternate representations of the same resource.

A very cursory look at the zip/jar file format shows that each "entry" contains some bytes called "extra" that might be able to be used for such a thing. But we'd have to describe the specific way that those bytes are used and need to be interpreted.

I haven't run across indexes for MIME - could you point to something along those lines?

I'd need to dig a bit. I'll try to find it for you.

EssyGreen commented 12 years ago

Can I just ask if the intention is for the GEDCOMX file to embed original media (e.g. scanned image) or to cross-reference them?

stoicflame commented 12 years ago

Can I just ask if the intention is for the GEDCOMX file to embed original media (e.g. scanned image) or to cross-reference them?

There is definitely an intention for the file to be able to embed original media, yes. Do you have a concern with that?

There is also definitely an intention for the file to "link to" (is that what you mean by "cross reference"?) original media.

EssyGreen commented 12 years ago

What I'm trying to figure out is how the editing of the content will be managed ... for example if I am a publisher of a GEDCOMX file then I will want/need to ensure people don't mess with it and breach copyright or alter the contents. Conversely, if I'm a researcher scanning a certificate then I want to add my own info but the original is under copyright. And again if I am scanning my own family photos then I have copyright/privacy and want to be able to edit at will. Can you advise how this will happen? (Feel free to transfer to a different post if that helps)

stoicflame commented 12 years ago

Hey @nealmcb, I've been warming up to your idea.

It turns out the jar file format does specify a way to define per-entry attributes, including a Content-Type attribute that describes the MIME type of the entry. Cool!

I've started a page to fill in the details for the alternate proposal:

https://github.com/FamilySearch/gedcomx/wiki/File-Format---Alternate-Propsal

I've put in some todo: notes in there that need to be filled in before we can have a viable proposal that we can take to the community and gather opinions on the two proposals.

Can you help me out and fill some of those in?

nealmcb commented 12 years ago

Another requirement to consider is support for signatures and encryption.

People or organizations might want to sign the whole GEDCOM, or individual objects or subsets, to provide a clear and effective statement of the authenticity of an archival document, an official version of a transcription, or even what their official research conclusions are.

They may want to encrypt the whole file, or individual objects, in order to allow delivery or storage of content to be separated from authorization to access all or part of it. I don't have a lot of helpful scenarios for this one yet, which tend to deal with complicated areas like payment and copyright, but I imagine there will be some compelling scenarios.

I've always thought the standard MIME encryption and signature stuff (based on either S/MIME or PGP) were really complicated by the underlying desire for backward-compatibility with ancient constraints and practice. I think it is handled more cleanly by ZIP/JAR, and know it is supported for Open Document and Android packages, but haven't done a detailed comparison.

@stoicflame Thanks for digging into JAR some more and fleshing something out! Good reference on JAR.

nealmcb commented 12 years ago

Hmm - actually, the latest version of the JAR spec seems to be from Java SE version 7: Java 7 JAR spec. But the differences from the 1.4.2 spec you pointed to all seem related to signatures and verification.

stoicflame commented 12 years ago

Hmm - actually, the latest version of the JAR spec seems to be from Java SE version 7

Fair enough. Let's update the link, then.

EssyGreen commented 12 years ago

Ryan,

Could you answer my question regarding the intention (forget the how if that helps) of how to manage copyright vs ability to edit:

for example if I am a publisher of a GEDCOMX file then I will want/need to ensure people don't mess with it and breach copyright or alter the contents. Conversely, if I'm a researcher scanning a certificate then I want to add my own info but the original is under copyright. And again if I am scanning my own family photos then I have copyright/privacy and want to be able to edit at will. Can you advise how this will happen? (Feel free to transfer to a different post if that helps)

nealmcb commented 12 years ago

@stoicflame I updated the link and made another addition on the alternate proposal wiki page. When I get a chance I'll dig into trying to answer some of the questions there.

@EssyGreen It seems to me that from a file format standpoint, we have requirements for both approaches: links to external multimedia objects and embedding them in the file itself. There should also be requirements on web services and applications to allow both approaches, which are of course affected by issues like copyright, but also efficiency, caching, etc. And of course a requirement to be able to express both copyright and licensing metadata, e.g. Creative Commons. I imagine the latter is covered by Dublin Core metadata in the xml.

EssyGreen commented 12 years ago

@nealmcb

It seems to me that from a file format standpoint, we have requirements for both approaches

Yup I got that but I just wondered how the embedded format will work (or is intended to work) given the needs for (a) publishers to restrict editing to preserve their data and copyright vs (b) researchers to control/edit/document their own source media. Or is it intended that the embedded format would only apply to say publishers and researchers would use the linked format?

The reason this is important to me is in the context of interpretations and derivatives ... if I have a file with an embedded image copy which I want to add my own transcription/interpretations to then will I be physically able to do this in the embedded format or will I be forced to create a derivative source which references the original? Similarly, if I receive an embedded file from say Ancestry which has a bad transcription, will I be physically able to correct that within the file or will I be forced to create a derivative and if so how do I separate the image copy which I want from the bad transcription which I want to throw away? These are the sorts of things I'm trying to understand.

nealmcb commented 12 years ago

@EssyGreen Indeed. But I guess I didn't actually say what I initially meant to say. Which is that your question belongs in its own issue, not here in the file format issue, because we've already established the related requirements on the format. Now it is mainly a question of how services and applications work. I think it would be very helpful for you to open a new one to discuss exactly the sorts of practical usage scenarios you raise. They will be applicable to many other design choices here. Documenting a "use case" is another big help at this stage of standardization.

EssyGreen commented 12 years ago

@nealmcb - Yes I sort of agree just wasn't sure it was worth a new post. Have created it as #151 now.

stoicflame commented 12 years ago

With many thanks to @nealmcb, we've adopted his suggestion for basing the GEDCOM X file format on ZIP/Jar.

Personally, I'm really happy with the new direction. I'm personally convinced that the ZIP-based format is a much better choice than a flat MIME-based format.

Here's the specification.

Here's a blog post mentioning the decision.

Here's a new GEDCOM 5.5 to GEDCOM X file conversion utility.

nealmcb commented 12 years ago

Sweet - glad to hear it has been adopted! The new spec looks good! Thanks for a great discussion.

stoicflame commented 12 years ago

Quick update on this. We're reconsidering this issue after doing some analysis of the initial implementation. Your comments on #183, #184, and/or #185 are welcome.

nealmcb commented 11 years ago

I'm delighted to see gedcomx moving forward, and that Zip was chosen! (see #184 and gedcomx/specifications/file-format-specification). And I'm impressed with the discussion and input from so many perspectives. Thanks.