frizbog / gedcom4j

Java library for reading/writing genealogy files in GEDCOM format
http://gedcom4j.org
53 stars 36 forks source link

Parsing fails for lines not starting with a digit #100

Closed haralduna closed 8 years ago

haralduna commented 8 years ago

As already discussed by e-mail, MyHeritage and other tools export GEDCOM files with lines not starting with a digit. Se example below. I have made a workaround that I am using myself and it apparently solves my issue. The patch is included below.

The fix simply discard such lines. I realize the users will experience loss of information, but as my tool is only reading and not writing GEDCOM files I think that is not a big deal.

Example:

2 CONC NBN:no-a1450-kb20061023030327.jpg</a></span></p>
<p>Gift 1769 med Haagen Anonsen Røstad</p>
1 RIN MH:I500144

0001-Workaround-Skipping-lines-not-starting-with-a-digit.txt

frizbog commented 8 years ago

Thanks Harald. To recap for others who have not been part of the email thread: Some tools that have line breaks in the text portion of a GEDCOM line do not emit the line break using a CONT line like the GEDCOM spec says you should. The tools that write data this way technically are not writing a spec-compliant file...and until issue #97 was completed, gedcom4j had the same issue (this is now fixed).

There was a follow-up question about touching up the parser so it would be able to read files written incorrectly this way without throwing an exception. This is a somewhat less-straightforward problem...what is the "right" way to read a malformed file? If the line following the line break begins with a single digit, it will look like a malformed tag line and not some continuation of the previous line.

Harald's solution is to simply discard the malformed lines, on the idea that a partial import is better than no import, which makes sense.

Still analyzing the issue and considering approaches.

frizbog commented 8 years ago

Release 2.2.6-SNAPSHOT now has a fix for this. The GedcomParser is getting a new flag/switch named "strictLineBreaks".

When true (the default), the parser will throw a GedcomParserException when a line in the GEDCOM file uses line breaks without escaping them as CONT or CONC lines like the spec requires.

When false, the parser attempts to make CONT lines on-the-fly and add them to the prior tagged line, as if the file had been written correctly in the first place, but it adds a warning to the GedcomParser.warnings collection. If it cannot make the CONT line, it ignores the non-standard line and adds a different warning.