briot / geneapro

Genealogical software based on the GenTech data model
http://briot.github.com/geneapro/
GNU General Public License v2.0
33 stars 8 forks source link

Fix for invalid DATE import appears to have introduced new fails for import. #41

Closed changeling closed 5 years ago

changeling commented 5 years ago

Previously, when the code that failed on invalid DATE tag line was in place, other tags with similar issues were gracefully handled without aborting import.

On my last attempt to import my primary testing file, which has many invalid RESI tag lines like:

1 RESI abt 1950

the import process would log the anomaly and continue. Here's a snippet from a previous run:

Line 12894576 Unexpected text value after RESI
Line 12894594 Unexpected text value after RESI
Line 6657850 Unexpected text value after RESI
Line 6657857 Unexpected text value after RESI
Line 6657886 Unexpected text value after RESI
Line 6657888 Unexpected text value after RESI
Line 6657894 Unexpected text value after RESI
Line 6660096 Unexpected text value after RESI
Line 6660099 Unexpected text value after RESI
Line 12903734 Unexpected text value after RESI
Line 12903765 Unexpected text value after RESI
Line 12903771 Unexpected text value after RESI

With the latest fix, it errors out upon the first encounter:

2019-02-12 05:29:43,652 parse Exception while parsing GEDCOM:<stdin>:2908 Line 2906 Unexpected text value after RESI
briot commented 5 years ago

Previously, when the code that failed on invalid DATE tag line was in place, other tags with similar issues were gracefully handled without aborting import.

On my last attempt to import my primary testing file, which has many invalid RESI tag lines like:

1 RESI abt 1950

the import process would log the anomaly and continue. Here's a snippet from a previous run:

Line 12894576 Unexpected text value after RESI

Interesting. I know exactly the change that does that… And indeed a warning would be better here. The GEDCOM is invalid, and data will be ignored, but we can still recover easily…

changeling commented 5 years ago

Alas, I think this one isn't quite ready to be Closed.

With the new fixes in place, we make it past RESI, only to run headlong into _MILT, who now insists on having a TYPE, and was another tag I used to watch scroll by as the import proceeded, but now kills the process.

2019-02-12 17:31:03,391 _process_FILE First pass: parse gedcomfile
<stdin>:2906 Unexpected text value after RESI, expected 'Y'
2019-02-12 17:31:03,912 parse Exception while parsing GEDCOM:<stdin>:3571 Missing 1 occurrence of TYPE in _MILT
1 _MILT
2 DATE 10 Jun 1780
2 PLAC North Carolina, United States
3 MAP
4 LATI 35.5
4 LONG -80.0
changeling commented 5 years ago

I'm afraid this one isn't ready for closing just yet.

On Tue, Feb 12, 2019 at 2:15 AM Emmanuel Briot notifications@github.com wrote:

Closed #41 https://github.com/briot/geneapro/issues/41 via 532f06e https://github.com/briot/geneapro/commit/532f06e556acbca18064d772d522161babe32fa9 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/briot/geneapro/issues/41#event-2132936618, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYd_r7LW0ZmZMT5_xfuqoZSHxGoqZOXks5vMngPgaJpZM4a1lXg .

briot commented 5 years ago

Alas, I think this one isn't quite ready to be Closed.

With the new fixes in place, we make it past RESI, only to run headlong into _MILT, who now insists on having a TYPE, and was another tag I used to watch scroll by as the import proceeded, but now kills the process.

Hard to decide when an error should be fatal or not.

I made a commit that simply skips custom tags (those starting with underscore), when they are malformed. The GEDCOM file is technically incorrect, but I guess better to ignore things (and let the user know) than just do nothing.

changeling commented 5 years ago

That was the behavior prior to these changes. My stress test file is ~15 million lines, and would make it all the way through the first pass parsing, through the second pass, and was only failing during insertion, as mentioned in some of our earlier interactions.

As far as this fix goes, perhaps re-opening this issue as we figure it out?

With this last fix, I get:

<stdin>:488812 Unexpected text value after RESI, expected 'Y'
<stdin>:490797 Skipping _MILT, missing 1 occurrence of TYPE
<stdin>:491203 Skipping _MILT, missing 1 occurrence of TYPE
<stdin>:491855 Unexpected text value after RESI, expected 'Y'
<stdin>:491861 Unexpected text value after RESI, expected 'Y'
<stdin>:495223 Unexpected text value after RESI, expected 'Y'
<stdin>:501305 Unexpected text value after RESI, expected 'Y'
<stdin>:501317 Unexpected text value after RESI, expected 'Y'
<stdin>:501329 Unexpected text value after RESI, expected 'Y'
2019-02-12 20:24:50,787 parse Exception while parsing GEDCOM:<stdin>:501491
Too many NICK in NAME

On Tue, Feb 12, 2019 at 12:49 PM Emmanuel Briot notifications@github.com wrote:

Alas, I think this one isn't quite ready to be Closed.

With the new fixes in place, we make it past RESI, only to run headlong into _MILT, who now insists on having a TYPE, and was another tag I used to watch scroll by as the import proceeded, but now kills the process.

Hard to decide when an error should be fatal or not.

I made a commit that simply skips custom tags (those starting with underscore), when they are malformed. The GEDCOM file is technically incorrect, but I guess better to ignore things (and let the user know) than just do nothing.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/briot/geneapro/issues/41#issuecomment-462885060, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYd_kXPbuQDpgaiz-kfDJohTEjrRm8tks5vMwyogaJpZM4a1lXg .

briot commented 5 years ago

That was the behavior prior to these changes. My stress test file is ~15 million lines, and would make it all the way through the first pass parsing, through the second pass, and was only failing during insertion, as mentioned in some of our earlier interactions.

Would you like to share that file with me while we fix this issue ? You can use gedcom_anon.py, at the root of this repository, to remove some of the data at least. I will of course not share this file...

changeling commented 5 years ago

I'm thinking on how to best share the file with you. I'll probably just email it, but it would be quite something to anonymize. Understand that if I send you the whole thing, I'll have to swear you to secrecy and it will make us blood brothers and you'll be adopted into my very, very extended family. :D

As the code stands now, with the latest commit, it reaches a point after which it's a continuous stream of:

2019-02-13 18:59:05,600 parse Exception while parsing GEDCOM:<stdin>:2688583 Too many INDI in  (skipped)

I made a slight temporary change in Line 388 in backend/geneaprove/utils/gedcom.py:

                if cdescr.max < count:
                    if 'INDI' in ctag:
                        fatal = True
                    else:
                        fatal =False
                    lexical.error(
                        'Too many %s in %s (skipped)' % (ctag, tag),
                        line=clinenum,
                        fatal=fatal)

to trap the first occurrence.

As an aside, here's how failure appears in the browser. Note that it incorrectly reports success upon abort:

screen shot 2019-02-13 at 3 34 32 pm
briot commented 5 years ago

As the code stands now, with the latest commit, it reaches a point after which it's a continuous stream of:

2019-02-13 18:59:05,600 parse Exception while parsing GEDCOM::2688583 Too many INDI in (skipped)

More than 10_000 individuals. A bad hard-coded limited in the code, now removed

I was able to parse your file (with a lot of warnings already discussed here) Parsing took 260s

Importing fails though, apparently an incorrect date error (you already reported a similar one and I thought it was fixed…)

I think the warning about _MILT missing a TYPE child is wrong, re-reading the gedcom standard.

changeling commented 5 years ago

Is the hard-coded limit new? In early versions, before the DATE issue fix, the program would run through the entirety and only fail upon SQL INSERT issues, most recently when it came to sources, if I recall. I'll try reverting to the previous incarnation and re-running to confirm.

Regarding '_MILT', the tag was an internal tag apparently created by FTM, so doesn't exist in the 5.5.1 standard definition. There's some debate in discussions about how to handle it, which seems to hinge on whether it should be rendered as an 'EVEN' or a 'FACT', but in either case, TYPE is definitely called for.

I'm attaching the invaluable annotated 5.5.1 Standard from Tamura Jones, on the off-chance you aren't familiar with it. In the question of _MILT and TYPE in particular, search for {Size=1:90}EVENT_OR_FACT_CLASSIFICATION:= in that file.

TFG551SAE Rev1 2018-07-05.pdf

changeling commented 5 years ago

I'll send you the _MILT Structures I've encountered as a separate email.

briot commented 5 years ago

Is the hard-coded limit new? In early versions, before the DATE issue fix, the program would run through the entirety and only fail upon SQL INSERT issues, most recently when it came to sources, if I recall. I'll try reverting to the previous incarnation and re-running to confirm.

So far, this was just a constant “unlimited” used to describe the grammar, but no check was enforced by the parser. I added those checks earlier this week.

Regarding '_MILT', the tag was an internal tag apparently created by FTM, so doesn't exist in the 5.5.1 standard definition. There's some debate in discussions about how to handle it, which seems to hinge on whether it should be rendered as an 'EVEN' or a 'FACT', but in either case, TYPE is definitely called for.

I was checking my code, and in fact I had already added a special case for _MILT ! This tag is also used by Gramps, which adds the TYPE tag. So here’s the grammar I am currently expecting:

F("_MILT", 0, unlimited, '', ADDR_STRUCT + [  # ??? gramps extension
      F("TYPE", 1, 1),
      F("DATE", 1, 1),
      F("NOTE", 0, unlimited),
      F("PLAC", 0, 1, '', PLACE_STRUCT),
      F("SOUR", 0, unlimited, "SOUR", SOURCE_CITATION),

Since this is a custom tag anyway, I’ll make TYPE optional and that will fix the warning. I might have to adapt the importer a little, but should be fine.

briot commented 5 years ago

I was able to parse your file (with a lot of warnings already discussed here) Parsing took 260s

Importing fails though, apparently an incorrect date error (you already reported a similar one and I thought it was fixed…)

The full file is now successfully imported. It takes a while :-)

   Done importing (1962.426 s)    =>  32min

The resulting sqlite database is large:

    803_237_888  bytes     geneaprove.sqlite

but I have seen worse :-)

I can likely do some optimizations there, but that’s not urgent in the sense that not many people have such large files, and those who do will not import them every day anyway...

The first time we load a view (like a pedigree) in the GUI, it takes a while. That’s because geneaprove basically loads a graph of the database to find all the ancestors/descendants relationships. Later queries are reasonably fast for pedigrees. There seems to be a bug there. You are person number 1, I believe, and yet no father/mother information was found for you. Though the file doesn’t seem to contain FAM entries for family number 1, which would explain it.

As expected, the list of persons fails. It takes more than 2min to prepare the response, so the connection is aborted, and the client retries later still with no success. Will have to improve on that.

The Stats page shows that there are 525_606 persons in the database. Very extensive family there.

Emmanuel

changeling commented 5 years ago

On my machine with all the debugging turned on, it took 118 mins. :O. But yes, this shouldn't be any sort of priority. I'll have to look at whether I inadvertently emiliminated my immediate family. That would be unfortunate. :)

And yeah, it's pretty extensive. I'll see about sending you the 32-generation one.

And with the latest _MILT fixes, alas, it now dies on an apparently arbitrary LATI element. Weird.

stdin>:79363 Skipping _MILT, missing 1 occurrence of DATE
2019-02-14 22:19:54,405 Exception while parsing GEDCOM:<stdin>:80103 Unexpected tag: LATI
2019-02-14 22:19:54,405 parse Exception while parsing GEDCOM:<stdin>:80103 Unexpected tag: LATI

I've made some, I think, unrelated cleanups in the big .ged. file, and I'll send you the current version. Should only be some consistency changes to PLAC lines.

changeling commented 5 years ago

The strange thing is that, by line 80103, it has survived many, many LATI and LONG tags by that point.

changeling commented 5 years ago

The record in question is:

0 @I2315@ INDI
1 NAME Someone /Theirname/
2 NOTE @N6954@
1 SEX F
1 BURI
2 DATE 30 Oct 1614
2 PLAC Chislett, Kent, England
3 MAP
4 LATI 51.344329
4 LONG 1.202212
2 NOTE @N6956@
1 DEAT
2 DATE Abt 1614
2 PLAC Dedham, Essex, England
3 MAP
4 LATI 51.95888  <--- The failure line
4 LONG 0.99341
2 NOTE @N6955@
1 FAMS @F1678@
changeling commented 5 years ago

An attempt to import the big ancestors file fails with:

2019-02-14 22:59:46,925 execute (0.001) UPDATE "researcher" SET "name" = 'Random Renamed /Whomever/', "place_id" = NULL, "comment" = NULL WHERE "researcher"."id" = 1; args=('Random Renamed /Whomever/', 1)
2019-02-14 22:59:47,540 (0.000) INSERT INTO "source" ("higher_source_id", "subject_place_id", "jurisdiction_place_id", "researcher_id", "subject_date", "subject_date_sort", "medium", "title", "abbrev", "biblio", "comments", "last_change") VALUES (1, NULL, NULL, 1, NULL, NULL, NULL, 'Untitled', 'Untitled', 'Untitled', NULL, NULL); args=[1, None, None, 1, None, None, None, 'Untitled', 'Untitled', 'Untitled', None, None]
2019-02-14 22:59:47,540 execute (0.000) INSERT INTO "source" ("higher_source_id", "subject_place_id", "jurisdiction_place_id", "researcher_id", "subject_date", "subject_date_sort", "medium", "title", "abbrev", "biblio", "comments", "last_change") VALUES (1, NULL, NULL, 1, NULL, NULL, NULL, 'Untitled', 'Untitled', 'Untitled', NULL, NULL); args=[1, None, None, 1, None, None, None, 'Untitled', 'Untitled', 'Untitled', None, None]
2019-02-14 22:59:47,555 Unexpected Exception during parsing
2019-02-14 22:59:47,555 parse Unexpected Exception during parsing
Total time: 818.530136s  /  queries time (9000 queries): 0.039000s

I updated the header since the one I sent you several minutes ago. I'll send the updated version now. This is the ManyMany... file.

briot commented 5 years ago

On my machine with all the debugging turned on, it took 118 mins. :O. But yes, this shouldn't be any sort of priority. I'll have to look at whether I inadvertently emiliminated my immediate family. That would be unfortunate. :)

I don’t see any FAM record in your file…

stdin>:79363 Skipping _MILT, missing 1 occurrence of DATE 2019-02-14 22:19:54,405 Exception while parsing GEDCOM::80103 Unexpected tag: LATI 2019-02-14 22:19:54,405 parse Exception while parsing GEDCOM::80103 Unexpected tag: LATI

This one is yours :-) When I look at the file using “gedcom_view.py” or “less”, I see the following:

80101 1 BURI 80102 2 PLAC Sanders Chapel Cemetery, Winn, Louisiana , 3 MAP 80103 4 LATI 31.95306 80104 4 LONG -92.823329

Note how “3 MAP” is on the same line as “2 PLAC”. Wrong line terminator.

briot commented 5 years ago

An attempt to import the big ancestors file fails with:

2019-02-14 22:59:46,925 execute (0.001) UPDATE "researcher" SET "name" = 'Random Renamed /Whomever/', "place_id" = NULL, "comment" = NULL WHERE "researcher"."id" = 1; args=('Random Renamed /Whomever/', 1) 2019-02-14 22:59:47,540 (0.000) INSERT INTO "source" ("higher_source_id", "subject_place_id", "jurisdiction_place_id", "researcher_id", "subject_date", "subject_date_sort", "medium", "title", "abbrev", "biblio", "comments", "last_change") VALUES (1, NULL, NULL, 1, NULL, NULL, NULL, 'Untitled', 'Untitled', 'Untitled', NULL, NULL); args=[1, None, None, 1, None, None, None, 'Untitled', 'Untitled', 'Untitled', None, None] 2019-02-14 22:59:47,540 execute (0.000) INSERT INTO "source" ("higher_source_id", "subject_place_id", "jurisdiction_place_id", "researcher_id", "subject_date", "subject_date_sort", "medium", "title", "abbrev", "biblio", "comments", "last_change") VALUES (1, NULL, NULL, 1, NULL, NULL, NULL, 'Untitled', 'Untitled', 'Untitled', NULL, NULL); args=[1, None, None, 1, None, None, None, 'Untitled', 'Untitled', 'Untitled', None, None] 2019-02-14 22:59:47,555 Unexpected Exception during parsing 2019-02-14 22:59:47,555 parse Unexpected Exception during parsing Total time: 818.530136s / queries time (9000 queries): 0.039000s

Not sure why you aren’t getting an exception backtrace. Anyway, I fixed two small errors, and now your file is imported. Took 17min on my machine

Getting a nice Fanchart (just showing 11 generations)...

changeling commented 5 years ago

Ack! Yeah, that was a poorly constructed BBEdit regex replacement on my part. And I completely missed seeing it.

Not sure about the stacktrace, but I'll fix these LATI issues and try myself with your latest commits.

I have turned on as much logging as I could in settings.py. I'll attach that file here. Dunno if or why that might pre-empt the stacktrace.

On Fri, Feb 15, 2019 at 2:46 AM Emmanuel Briot notifications@github.com wrote:

An attempt to import the big ancestors file fails with:

2019-02-14 22:59:46,925 execute (0.001) UPDATE "researcher" SET "name" = 'Random Renamed /Whomever/', "place_id" = NULL, "comment" = NULL WHERE "researcher"."id" = 1; args=('Random Renamed /Whomever/', 1) 2019-02-14 22:59:47,540 (0.000) INSERT INTO "source" ("higher_source_id", "subject_place_id", "jurisdiction_place_id", "researcher_id", "subject_date", "subject_date_sort", "medium", "title", "abbrev", "biblio", "comments", "last_change") VALUES (1, NULL, NULL, 1, NULL, NULL, NULL, 'Untitled', 'Untitled', 'Untitled', NULL, NULL); args=[1, None, None, 1, None, None, None, 'Untitled', 'Untitled', 'Untitled', None, None] 2019-02-14 22:59:47,540 execute (0.000) INSERT INTO "source" ("higher_source_id", "subject_place_id", "jurisdiction_place_id", "researcher_id", "subject_date", "subject_date_sort", "medium", "title", "abbrev", "biblio", "comments", "last_change") VALUES (1, NULL, NULL, 1, NULL, NULL, NULL, 'Untitled', 'Untitled', 'Untitled', NULL, NULL); args=[1, None, None, 1, None, None, None, 'Untitled', 'Untitled', 'Untitled', None, None] 2019-02-14 22:59:47,555 Unexpected Exception during parsing 2019-02-14 22:59:47,555 parse Unexpected Exception during parsing Total time: 818.530136s / queries time (9000 queries): 0.039000s

Not sure why you aren’t getting an exception backtrace. Anyway, I fixed two small errors, and now your file is imported. Took 17min on my machine

Getting a nice Fanchart (just showing 11 generations)...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/briot/geneapro/issues/41#issuecomment-463955865, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYd_hEdL6HBlF2Rsoxu4OPqfesfBW-3ks5vNnPxgaJpZM4a1lXg .