andy-z / ged4py

GEDCOM tools for Python
MIT License
18 stars 7 forks source link

Combining characters in ANSEL documents not handled properly #7

Closed haney closed 5 years ago

haney commented 5 years ago

Description

Combining characters in ANSEL documents do not appear to be handled appropriately. In the ANSEL encoding, combining characters occur before the character they modify, however in Unicode, they occur after. This translation does not appear to be happening when reading ANSEL GEDCOM documents.

What I Did

import io

import ged4py

doc = b"""
0 HEAD
1 CHAR ANSEL
0 TAG P\xea
1 CONC al
""".strip()

with io.BytesIO(doc) as file:
    with ged4py.parser.GedcomReader(file) as reader:
        note = reader.read_record(20)
        print(note.value)

Given the document, I would have expected the output:

Pål

Instead I'm seeing

P̊al

This implies that the position of the combining character is unchanged when it was translated to unicode, however given the rules for combining characters in unicode, it is getting applied to the first character instead of the second.

andy-z commented 5 years ago

The issue is fixed in v0.1.11, should be on PyPI already. Thanks @haney for bug report and especially for providing a fix! I have not built Windows binary for ged2doc with this fix, I presume you do not need that but let me know if you want it rebuilt.