Add support for Gedcom files starting with BOM

amandasaurus / gedcompy

Python library to parse and work with GEDCOM (geneology/family tree) files

GNU General Public License v3.0

39 stars 18 forks source link

Add support for Gedcom files starting with BOM #10

Open BioGeek opened 9 years ago

BioGeek commented 9 years ago

Sites like geni.com let you export Gedcom files that start with a Byte Order Mark (BOM).

Currently the regex fails for such files and you get a NotImplementedError.

See this detailed article for more about GEDCOM & the Unicode Byte Order Mark.

I'm currently toying with a solution like described here to remove the BOM and encode/decode the string, but I still get strange characters in the output.

amandasaurus commented 9 years ago

I've added some support for BOMs in the new unicode-support branch. It should use a BOM (if present) to use the correct encoding. Can you try it out on files that you have?

There are a few other parts to this task that I haven't done yet:

[x] Support BOM
[x] Add HEAD.CHARACTER SET head tag
[ ] Parse and use the HEAD.CHARACTER SET tag if there is no BOM
[ ] Support ANSEL (?!)