FamilySearch / gedcomx

An open data model and an open serialization format for exchanging genealogical data.
http://www.gedcomx.org
Apache License 2.0
350 stars 67 forks source link

Dates should not be composed of parts #130

Closed jralls closed 12 years ago

jralls commented 12 years ago

There are two kinds of dates:

EssyGreen commented 12 years ago

If the "Evidence dates" were part of the raw data held in the Record I would totally agree with you, but the Facts within the Record are actually interpretations of the raw data (rather than the raw data itself) but confined within in the scope and context of the record (rather than anything in the Conclusion Model). The raw data would be in the "Transcription" (see issue #121).

However, I agree that the (Record Model) Date as a collection of Parts is not good since there is no indication of how to make sense of the parts as a whole. For example, if I get 2 different Year parts, does this imply a range (from/to) or an approximation (between x and y)? I would prefer a much more structured approach which defines all the possible date phrases eg. Calendar, Day1, Month1, Year1, Day2, Month2, Year2, IsApproximate.

Re the Conclusion Dates - since the Record Model Dates are already an interpretation already I think the same approach should be taken with both.

jralls commented 12 years ago

Structured approach: Both approaches are structured. Yours uses more fields. Not having an encyclopedic knowledge of calendars (and not having time at the moment to check my copy of Calendrical Calculations), I'm not certain that all calendars map to the day-month-year format of western ones.

I do agree that Dates also need to accommodate a range and needs a multi-value precision flag (before, after, from/to, between, aboutDay, aboutWeek, aboutMonth, aboutSeason, aboutYear).

I intentionally did not specify where in the model "Evidence" and "Conclusion" apply. You're saying that a Record contains conclusions, and I agree. I think that it's important to preserve both the original form and the interpreted form.

Whether or not it is a good idea to shred the evidence contained in a document into multiple objects is a matter for another debate -- and it's a debate that has been going on for all of the 13+ years since the Gentech model was first released. That debate belongs in another issue, since it's rather broader in scope than date representation.

EssyGreen commented 12 years ago

Mine ("Calendar, Day1, Month1, Year1, Day2, Month2, Year2, IsApproximate") was an example only - not an attempt at a definition. You are quite right in that some calendars would not fit the Day/Month/Year format (although I believe that most do) and in my example, the Calendar (e.g. Gregorian, Julian etc) would be the thing which enables the validity of it's components to be ascertained. I'm guessing that this is how the FormalValue in GEDCOMX is intended to work but I struggle to make sense of it.

I was not saying that the Record contains conclusions, I was saying that I would prefer there to be just one Date object definition, rather than the two currently defined in GEDCOMX:

stoicflame commented 12 years ago

I believe this issue has been addressed. Please consider the recipes that have been added to the recipe book. I believe using ISO 8601 is using a "calendar-independent number" as you articulate above.

EssyGreen commented 12 years ago

Just to clarify ... "original" means "as interpreted by the author of the gedcom file" as opposed to "as stated in the original source document"? I think this should be clarified somewhere in the spec (if it isn't already) since it could easily be misunderstood and assumed to be the former.

stoicflame commented 12 years ago

The spec currently describes it as:

The original value of the date as supplied by the contributor.

What if we changed it to:

The original value of the date as interpreted by the contributor.

Would that be clear enough?

EssyGreen commented 12 years ago

No need - your phrase fits your usage - my fear was that people would think it was the original date in the source.

jralls commented 12 years ago

Well, I'm not going to buy ISO8601 just to discuss it, so I'll rely on http://en.wikipedia.org/wiki/ISO_8601. That uses a proleptic Gregorian calendar as its basis. It's not culturally neutral, but it certainly is widely understood and is moreover an international standard, so it's acceptable to me as the required date representation in conclusional data.

However, unless the recipes are authoritative (I sincerely hope that's not planned!), mentioning date representations there isn't sufficient. The statement of standard-header-set-specification.md para 3.3 should be repeated in conceptual-model-specification.md para 5.5 as the required FormalValue date representation.

jralls commented 12 years ago

Thinking a bit more on that, it would be useful to extend ISO8601 to specify a representation for a specific date range (two date strings? A date string followed by a "period" string?) and some precision operators as I proposed in the second comment.

EssyGreen commented 12 years ago

Quite frankly I think this is another area where we are in danger of building in way too much complexity. The only differences between genealogy and any other app is we need to cater for (a) old dates and (b) vague dates. Neither of these seem to be exemplified in the recipes but I'm assuming (praying) they are still catered for since these seem to me to be the critical factors.

Given GEDCOM X is basically still in a text format why not just keep the good old GEDCOM 5 standard (which included a custom date)?

jralls commented 12 years ago

Constraining dates to a single representation in the interchange format simplifies the exchange. Gedcom5 has no separation between evidence and conclusion -- and neither do most genealogy programs, which usually assume that whatever you put in is proleptic Gregorian unless you double-date for a day in the first 3 months of the year. That's wrong.

We genealogists deal with a variety of dates. If you're worried only about "old" dates, you must have only English ancestors after 1583 and no Muslims, Jews, or Catholics. Not all genealogists have that luxury, and keeping dates straight when dealing with records in numerous calendars can be quite a challenge. Gedcom5 was indeed designed with the same Anglo-centric view that you just expressed, but there's no need to perpetuate the error.

EssyGreen commented 12 years ago

[...] which usually assume that whatever you put in is proleptic Gregorian

That's just ignoring the GEDCOM standard - we can't ever prevent that

If you're worried only about "old" dates, you must have only English ancestors after 1583 and no Muslims, Jews [..]

You missed my point ... normal apps also have to consider these too ... and btw Hebrew was already included in GEDCOM 5

Gedcom5 was indeed designed with the same Anglo-centric view that you just expressed

What's Anglocentric about allowing any custom date format?

My point is if we try to cover everything we'll be paralysed by analysis and never get anything done.

stoicflame commented 12 years ago

Thanks for your comments, guys.

Since there's obviously more work to be done here with dates (at the very least with docs), I'd like to have an open issue to track the work.

I'd be happy to re-open this one, but I'd like to rename the issue since it really isn't about date parts anymore.

Or we can open a new issue.

Thoughts?

EssyGreen commented 12 years ago

New one please :)

jralls commented 12 years ago

Or we can open a new issue.

No real preference.

jralls commented 12 years ago

What's Anglocentric about allowing any custom date format?

Nothing. The Anglocentric bit was "old dates and vague dates".

My point is if we try to cover everything we'll be paralysed by analysis and never get anything done.

If everything is custom then the user is required to interpret every datum at import. That's not likely to be widely adopted.

EssyGreen commented 12 years ago

The Anglocentric bit was "old dates and vague dates".

Surely every culture has old dates and vague dates?

If everything is custom then the user is required to interpret every datum at import.

Only if the target application is hell bent on coding everything up. When/why does the application need to do this? I'm not saying it never does but what are the functions that require coding of dates? Top of my head:

(a) to sort/order facts/events within a Person (b) to auto-search for potential matches in other on-line sources

It is perfectly possible (and maybe preferable to some users - self included) to have a genealogy application that requires neither of the above, leaving the user to choose (a) and do (b) manually. Yet GEDCOM standard will insist that the app. codes stuff up just for the sake of the other apps out there that insist on taking the power/responsibility away from the user in preference to robotic genealogy.

Yes, I'm being a bit extreme here but do you see my point?

As a user I frequently find myself frustrated by existing applications which attempt to do (a) and (b) for me and don't allow for the many situations where it won't work. A couple of examples ... (a) I know my grandmother's birthday but not the year she was born ... In your situation I suspect I wouldn't be able to enter anything at all for her birth date. (b) The auto-find feature in several apps has a tendency to spit out lots of irrelevant matches such as index entries or my own submissions to other peoples trees which detract from the real work of finding quality stuff (which is invariably off-line).

Give me a first class genealogy program with a structure which supports evidence and conclusion based research and a UI which enables easy navigation and I'd be a happy bunny. Give me an app which codes everything up for the sake of it and I become a data-entry slave.

jralls commented 12 years ago

Yes, I see your point. In fact, I half agree with it: A program which uses normalized dates must have normalization routines to deal with user input, so if GedcomX specifies that dates should be encoded as a date string and calendar and approximation (before, after, about, etc.) designators then the receiving program should be able to do its own normalization. Most programs unfortunately normalize on input and trash the original input value, so they'd just output a convenient string and designate it Gregorian.

However, some parsers may be better than others, and approximation and calendar designations likely will be localized. Having a designated string format of YYYY-MM-DD (ISO8601) where just YYYY and YYYY-MM are acceptable greatly simplifies the parsing at import with no real penalty at export except in the unusual case of a program that does no date parsing at all. Similarly specifying enum values for approximation and calendar designators ensures that those values can be exchanged even in the presence of localization and other variation.

Another advantage to a specified set of values and a single format: It allows validation of the file for error detection.

EssyGreen commented 12 years ago

Most programs unfortunately normalize on input and trash the original input value, so they'd just output a convenient string and designate it Gregorian.

Is this because GEDCOM 5 has over-specified the date format I wonder? ie if GEDCOM had always just specified the date as a string I suspect more programs would leave it as such. Maybe, maybe not - but regardless it proves that we cannot enforce a particular standard whatever we may wish.

It allows validation of the file for error detection.

Only in some very specific situations e.g. ensuring the birth is the first event in someone's life (common tho' not necessarily wanted - say I want to record conception events), checking the age at marriage (a bad idea in my opinion since this is very time/culture dependent). I'm sure there are others but I can't think of any that aren't application-specific. If a developer wants to include their own validation then they can have their own special date parser - which could be culture/time dependant if the app. was designed for a particular target audience. That leaves them to do the hard work and us to leave it well alone :)

EssyGreen commented 12 years ago

One of my pet hates is having to convert from Q1/Q2/Q3/Q4 format into something else ... In the UK the quarter of the year is frequently used in sources and it is easier for the user to match these up if they are left well alone. I'm not expecting GEDCOM to come up with a specific format to cater for this - just don't mess with my input :)

GeneJ commented 10 years ago

Hope I'm adding to an old, but not superseded thread.

Reading through the comments above, I find so many good suggestions. Unaware of the status, boldly suggesting some functional priorities

(1) What is the status and/or have you implemented support for quarterly dates? (See EssyGreen's request, just above.) (2) What is the status and/or have you implemented support four double dates? (Comment follows regarding "hacked" dates.) (3) What is the status and/or have you implemented support for partial dates, other than YYYY and/or MMM YYYY? In other words, date information such as DD XXX YYYY; DD MMM XXXX, where "X" is used to specify missing information)? (4) What is the status and/or have you implemented support for what I'll call an "uninterpreted date?" By this I mean a record date such as "35 April 1822"; for many, ecclesiastical dates (not further rendered); "22: 12m: 1675" (not further rendered). (5) What is the status of expanding the terminology for approximated dates? For example, in addition to "about," "before," etc. qualifiers for "probably 6 June 1875," "perhaps 6 June 1875"--even "probably before 6 June 1875."

There at least seem simple solutions for the first few of the functional requirements suggested above.

I hope the content/record providers have requested this support as well, as I find many dates that have been so-to-speak "hacked" into indexed records. More and more I am finding this to be the case with double-dates; entries like "22: 12m: 1675," too.

stoicflame commented 10 years ago

Hi Gene.

What is the status and/or have you implemented support for quarterly dates?

Yes, I think that's an example of an approximate date range.

What is the status and/or have you implemented support four double dates?

Can you elaborate? What's a "double date"? Apologies for my ignorance...

have you implemented support for partial dates?

So for cases like this, there exists an "original" field on the date (just text) where the date can be input as stated on the record. The "formal" date would be either omitted or broadly scoped (e.g. sometime generally between x and y).

support for what I'll call an "uninterpreted date?"

Again, uninterpreted implies that there's text in the original place, but it hasn't been "interpreted" into a "formal" date yet.

expanding the terminology for approximated dates?

So my feeling is that this kind of vocabulary belongs on the conclusion that contains the date. So the confidence of the "birth" (or whatever) is low (or medium or high or whatever) and has a date of "about 6 June 1875" (or "before 6 June 1875" or whatever).

I hope the content/record providers have requested this support as well

Well, to tell the truth, they haven't. At least not to my knowledge. If there is demand from industry, I personally haven't perceived it.

GeneJ commented 10 years ago

Most helpful link, Ryan, Thank you. I probably got the message when I read, "Dates MUST be specified using the proleptic Gregorian calendar."

I at least suspect it is the conversion from alternate calendars (or date styles) to the Gregorian that becomes the functional issue. Experts may not have issues with these conversions, but others do, especially those getting started.

Link below is to some transcribed and digitized published early vital records from Salem, Mass. http://ma-vitalrecords.org/MA/Essex/Salem/aBirthsB.shtml

(a) Third entry on the page is a "double date" example. Double dates appear in records or compilations of same about events occurring prior to the conversion from Julian to Gregorian. In the example, dear old Mary was born before the conversion. The clerk who created these records entered her birth as "Mar. 7, 1676-7." That date might be seen also written "Mar. 7, 1676/7." I find a reasonable number of dates like this indexed erroneously, as "07 March 1676," would be in this case.

(b) Fourth entry on the page is an "old style date." Bouncing baby Ruth Babadge, was born "21: 1m: 1663." I'm finding a reasonable number of "old style dates" indexed erroneously, as "21 January 1663" would be for this entry.

Other thoughts on this, but having read the associated information (thank you, again), I'll just hope a refinement comes in someday so that the 'puter is doing a little more of the work for us, but still consistently across products and services.

stoicflame commented 10 years ago

I at least suspect it is the conversion from alternate calendars (or date styles) to the Gregorian that becomes the functional issue.

Maybe so.

Experts may not have issues with these conversions, but others do, especially those getting started.

Indeed. So it's up to the experts to make sure conversion happens correctly so the data can be exchanged accurately.

I find a reasonable number of dates like this indexed erroneously, as "07 March 1676," would be in this case.

Fascinating. Thanks for the education.

I'm a bit embarrassed that I have to ask, but oh well: what is the right index for that value?

I'm finding a reasonable number of "old style dates" indexed erroneously, as "21 January 1663" would be for this entry.

And what's the right index for that value?

GeneJ commented 10 years ago

Hi Ryan,

"And what's the right index for that value?"

A old-style date "9 : 11 : 1658" translates directly as "9 January 1658/9" The old-style date "5: 2 : 1662" translates as "5 April 1662"

A reference follows. Knowing the significance of March 25 (aka Lady Day) has more than once led to my own logic error in recording these values, so I devised a little rhyme--"Plus two takes you from old to new." Thus, the entry for our bouncing baby's birth, "21: 1m: 1663," would be entered "21 March 1663/4"

"The 1752 Calendar Change," Connecticut State Library, 2008 (http://www.cslib.org/CalendarChange.htm : accessed 2013).

P.S. Separately, because years had different numbers of days, etc. there are mathematical schemes to more literally convert a Julian calendar date to its modern calendar equivalent. Here's the link to article about this from a department of the U.S. Naval Observatory, http://aa.usno.navy.mil/faq/docs/JD_Formula.php . The article also mentions work of "Fliegel and van Flandern (1968)." They "published compact computer algorithms for converting between Julian dates and Gregorian calendar dates ..."

jralls commented 10 years ago

A reference follows. Knowing the significance of March 25 (aka Lady Day) has more than once led to my own logic error in recording these values, so I devised a little rhyme--"Plus two takes you from old to new." Thus, the entry for our bouncing baby's birth, "21: 1m: 1663," would be entered "21 March 1663/4"

Note, however, that the Gregorian calendar was offset from the Julian one by 10 days from its inception in 1583 until 1699 (the difference between Gregorian and Julian calendars being that the former doesn't have leap years in years evenly divisible by 100 unless they are also evenly divisible by 400, and the change was propagated back to the beginning of the common era), so those dates should be encoded in GedcomX as 1659-01-19, 1662-04-15, and 1664-03-31.

The definitive reference is C.R. Cheney, Handbook of Dates for Students of English History. The definitive book and more or less FOSS software (the original work is part of Gnu Emacs, and the authors are quite happy to give permission for use in FOSS projects) for calendar conversions, covering most of the calendar systems used throughout history, is Reingold and Dershowitz, Calendrical Calculations.

Note as well that it's the application's job to do the conversion to proleptic Gregorian for insertion into the formal field of Date.

GeneJ commented 10 years ago

Hi John,

I acknowledged above that as date standards go, the heavy lifting does seem left to the different applications and/or content providers. Nonetheless, users struggle, even with some of the slickest UIs, and I observe above average error rates with regard to these dates.

It's up to you all to decide over time whether the underlying functions and/or requirements ever rise to a level of interest or importance with regard to the standard.

jralls commented 10 years ago

Yes, roger, most apps do it badly. The only way to get that fixed is to bitch at the developers. GedcomX isn't a spec for writing applications, it's a spec for exchanging data between applications. If the application writing the GedcomX stream messes up, there's nothing GedcomX can do to fix it. Garbage in, garbage out and all that.