Closed haralduna closed 8 years ago
Thanks Harald. To recap for others who have not been part of the email thread: Some tools that have line breaks in the text portion of a GEDCOM line do not emit the line break using a CONT line like the GEDCOM spec says you should. The tools that write data this way technically are not writing a spec-compliant file...and until issue #97 was completed, gedcom4j had the same issue (this is now fixed).
There was a follow-up question about touching up the parser so it would be able to read files written incorrectly this way without throwing an exception. This is a somewhat less-straightforward problem...what is the "right" way to read a malformed file? If the line following the line break begins with a single digit, it will look like a malformed tag line and not some continuation of the previous line.
Harald's solution is to simply discard the malformed lines, on the idea that a partial import is better than no import, which makes sense.
Still analyzing the issue and considering approaches.
Release 2.2.6-SNAPSHOT now has a fix for this. The GedcomParser is getting a new flag/switch named "strictLineBreaks".
When true (the default), the parser will throw a GedcomParserException when a line in the GEDCOM file uses line breaks without escaping them as CONT or CONC lines like the spec requires.
When false, the parser attempts to make CONT lines on-the-fly and add them to the prior tagged line, as if the file had been written correctly in the first place, but it adds a warning to the GedcomParser.warnings collection. If it cannot make the CONT line, it ignores the non-standard line and adds a different warning.
As already discussed by e-mail, MyHeritage and other tools export GEDCOM files with lines not starting with a digit. Se example below. I have made a workaround that I am using myself and it apparently solves my issue. The patch is included below.
The fix simply discard such lines. I realize the users will experience loss of information, but as my tool is only reading and not writing GEDCOM files I think that is not a big deal.
Example:
0001-Workaround-Skipping-lines-not-starting-with-a-digit.txt