FamilySearch / gedcom5-java

Gedcom parsers
Apache License 2.0
68 stars 41 forks source link

Here is a parser that converts GEDCOM files to a de facto object model.

De Facto object model

De Facto means "In fact or in practice; in actual use or existence, regardless of official or legal status."

The parser converts GEDCOM files to an object model that includes all of the information found in the majority of GEDCOMs in the wild. The model includes additional tags over those in the official GEDCOM standard, because they are commonly used in GEDCOM files, and excludes a few tags from the official GEDCOM standard that are never or only rarely used.

Further, most of the information from GEDCOMs that cannot be represnted directly in the model is represented as extensions so it is not lost.

The object model includes classes and attributes for every GEDCOM tag sequence appearing in more than 4% of the 7000 GEDCOMs submitted to WeRelate.org over the past five years, with the exception of four software-specific schema tags: _SCHEMA, _EVENT_DEFN, _PLAC_DEFN, and _EVDEF, generated by Family Tree Maker, Personal Ancestral File, Legacy, and RootsMagic respectively.

Additional information found in the GEDCOMs, such as the schema tags mentioned above, is represented in the model by extending model objects with the ability to store lists of additional tags.

The result is that object model directly represents all of the information found in nearly 50% of the GEDCOMs. This may not sound like a large percentage, but due to the standard not being updated in over 10 years, nearly everyone adds their own custom tags. So having a relatively simple object model represent all tags found in nearly 50% of GEDCOMs is an accomplishment.

If we also include the additional tags storable on model objects, the model is able to represent all of the information found in roughly 98% of the GEDCOMs submitted to WeRelate.

The object model has the normal classes you'd expect for a GEDCOM-based object model: people, families, source citations, sources, notes, repositories, etc. The purpose of this project is not to propose a new object model, but to expose the object model that is currently used by genealogists and make it easy to work with.

A new proposed object model could use this project to convert existing GEDCOM files to the new model by first converting them to the de facto object model, then transforming the objects into the proposed object model.

For more information about the object model, see the wiki.

Extendible

Developers can add custom extensions to the model. An extension might annotate people with warnings about suspicious dates for example.

Parsers

The project includes three parsers:

as well as a GEDCOM export tool:

Round-trippable

It is possible to do a round-trip: parse a GEDCOM file into the object model, save it to json, read it back from json, and export it back to GEDCOM, without any loss of information for the majority (over 94%) of GEDCOM files.

The round-trip capability allows anyone to create programs that read gedcom files, do interesting things like generate warnings for suspicious dates in the GEDCOM, allow the user to correct the warnings, and save the information back as a GEDCOM file without loss of information from the original GEDCOM for the vast majority of GEDCOM files.

Building

You'll need maven. mvn install creates the jar file.

Tools

The tools can be run using the gedcom.jar archive from the target directory: java -cp target/gedcom.jar org.folg.gedcom.tools.<tool name> <args>

For example: java -cp target/gedcom.jar org.folg.gedcom.tools.Gedcom2Json -i mytree.ged -o mytree.json

Roadmap