FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
356 stars 67 forks source link

settle on the granularity of the file format data blocks #183

Closed stoicflame closed 11 years ago

stoicflame commented 12 years ago

Analysis of the GEDCOM X file format shows that the efficiency of the ZIP file format degrades as the size of the entries gets smaller and the number of entries grows. As defined today, the GEDCOM X file format specifies a large number of small files, the side effect being that ZIP itself almost doubles the size of the file.

So we need to rethink whether we want to decrease the number of entries and increase their size by bundling up the entities together into some kind of data blocking strategy. This issue is opened to discuss that strategy.

On one side of the spectrum is what we have today: each entity (person, relationship, source) is its own entry. The reason this strategy was selected was because it allows for a lot of flexibility for processors to decide how to divide up the processing. It also allows for the self-description mechanism to apply at the entity level so that processors can perform more powerful analysis of the file without doing any parsing of the entries.

On the other side of the spectrum is that everything (except maybe multimedia files) is put into a single file.

There are other strategies in the middle. For example, we could bundle all the persons into a block, all the relationships into another block, etc.

stoicflame commented 12 years ago

(Now that I've opened up the issue, I'll take the time to register my personal opinion.)

I like one entry per entity. I want the processing flexibility and the self-description granularity.

I guess that shouldn't be too much of a surprise to anyone :-).

EssyGreen commented 12 years ago

+1

ttwetmore commented 12 years ago

I agree that one file per entity is okay, as long as we can get rid of all the namespaces and other boilerplate.

stoicflame commented 11 years ago

In preparation for the pending milestone 1 release of GEDCOM X, we are making the final decisions on the nature of the file format. The file format specification has been updated to reflect our decisions.

For this particular question, the decision was made to allow implementations the flexibility of deciding how big they want their data blocks. They can be as small or large as they want because a mechanism was provided to make both "same-document" references and "relative path" references. Default implementation will encourage large-grained data blocks. Implementations that choose smaller data blocks won't get as much compression optimization.