DavidMStraub / python-gedcom7

GEDCOM 7 parser for Python
MIT License
10 stars 2 forks source link

wrong parsing of DatePeriod #2

Closed geostag closed 1 year ago

geostag commented 1 year ago

The DatePeriod parser unintentionally parses the keywords of DatePeriods as month in some cases.

import gedcom7.types as T
T.DatePeriod('FROM 21 JAN 1990 TO 22 FEB 2000').parse()
T.DatePeriod('FROM  1990 TO  2000').parse()

The first DatePeriod ist recognized correctly as period from 1999-01-21 to 2000-02-22. The second one however, which is a valid DatePeriod according to the GEDCOM 7 specification, is parsed as

{
   "from":{
      "calendar":"None",
      "day":1990,
      "month":"TO",
      "year":2000,
      "epoch":"None"
   }
}

The root cause is, that the orginal specification of GEDCOM 7 does not exclude the keywords from the month pattern, thus leading to this ambiguous interpretation, which is formal valid but not intended.

With the following change in grammar.py, you get the intended interpretation:

- month = f'({stdtag}|{exttag})'
+ no_months = f'FROM|TO|BET|AND|AFT|BEF'
+ month = f'(?!{no_months})({stdtag}|{exttag})'

Perhaps this should be fixed in GEDCOM 7 as well.

DavidMStraub commented 1 year ago

Interesting, thanks! Yes, I think this should be raised at the Gedcom repository as well, can you open an issue there?

geostag commented 1 year ago

I submitted the question to gedcom-L, the german mailing list for GEDCOM specification.

geostag commented 1 year ago

No reaction so far. Therefore I checked out the issues in https://github.com/FamilySearch/GEDCOM/issues/ . It seems, that their approach is not to reflect everything in the ABNF rules. Some of the rules, which have to be followed to parse GEDCOM correctly, are written in the text part only. E.g.

In addition to the constraints above:
...
No calendar names, months, or epochs match dateRestrict .

in the GEDCOM definition seems to be regarded as sufficient.

Thus I suggest to just add the dateRestrict definition in the grammar and use this in the negative lookahead for the month rules as well in the rules for calendar and epoch.

geostag commented 1 year ago

PR modified to use daterestrict to exclude the keywords from month, epoch and calendar. BCE gets a special handling, because it is a valid epoch.

geostag commented 1 year ago

epoch handling: https://github.com/FamilySearch/GEDCOM/issues/273

DavidMStraub commented 1 year ago

Fixed by #3 :rocket: