frizbog / gedcom4j

Java library for reading/writing genealogy files in GEDCOM format
http://gedcom4j.org
53 stars 36 forks source link

Robust string to date convertor #114

Closed frizbog closed 8 years ago

frizbog commented 8 years ago

It would be helpful to have a utility to take string representations of dates and turn them into java.util.Date values, for doing sorting, age calculations (with varying degrees of precision), etc.

Date strings can be imprecise...for example "Aug 2016" would be interpreted as 2016-08-01, 2016-08-31, or something in between. "Btw 1946 and 1948" would be interpreted as 1946-01-01, or 1948-01-01, or something in between.

This interpreter of date strings should have some hints for how to resolve imprecise dates - whether to favor earliest, latest, or midpoints.

Formats of dates that should be supported:

Months should allow numbers, three-letter abbreviations, certain four-letter abbreviations, and full spellings. Abbreviations should allow periods to be present or omitted. Date separators should be allowed to be slashes, periods, hyphens, commas, or whitespace

Dates prefixed with "Abt.", "About", "Appx", or "Approximately" and missing either a day value or a month value should be interpreted as a range of dates, then returned based on the earliest/midpoint/latest hint.

If two dates are supplied,

the dates should again be interpreted as a range, then returned based on the earliest/midpoint/latest hint.

There would be no way to convert an interpreted date back into string form and get the original form.

frizbog commented 8 years ago

There is a lot of material in both the 5.5 and the 5.5.1 spec on the formats of dates....amazing what you discover when you read. 😄 It's remarkably and unexpectedly specific. In particular, dates in the form mm/dd/yyyy or yyyy-mm-dd or dd.mm.yyyy are not compliant with the spec. Dates should look like 17 JUL 2016.

It's also pretty specific about how to do ranges, approximate dates, etc.

Since the spec is so specific (no pun intended), this will simplify things significantly and I can limit support to spec-compliant date values, at least for a first pass. Later, I could possibly expand/relax things if the need is really there. A quick non-scientific scan of a large number of GEDCOM files seems to indicate that most dates are being written correctly.

frizbog commented 8 years ago

Remaining work: French Republican calendar, Hebrew calendar, and support around the English calendar change of 1752 (where the years of dates can have slashes).

frizbog commented 8 years ago

Hebrew calendar complete, English Gregorian Calendar change and double-date support complete. Remaining: French Republican Calendar.

frizbog commented 8 years ago

Remaining work: Ranges and Periods for French Republican calendar.

frizbog commented 8 years ago

3.0.1-SNAPSHOT as of 2016-07-21T19:33:44-04:00

frizbog commented 8 years ago

Released in v3.0.1