ANSEL input and misplaced diacritic characters

feloy commented 8 years ago

When I load a GEDCOM file ANSEL-encoded with accented characters, accents are placed in the character before the one expected.

For example: input: ae output: should be: aè is: àe

frizbog commented 8 years ago

Thanks feloy. I'm having difficulty fully understanding the issue you are reporting. I would expect that an input of "ae" would be preserved exactly as "ae", so I think there must be something you are trying that I don't understand.

Are you reading a file containing accented data and it is not reading correctly? Are you creating a file via code, but it is not writing correctly? Or are you reading a file, re-writing it, and the accents are changing during the process? Or is it something else entirely?

feloy commented 8 years ago

Thanks for your quick reply.

Unfortunately, some characters disappeared from my github message.

I'm reading a file with accented characters, which is not read correctly.

In the input file : a 0xE1 e (for aè) ( 0xE1 is missing in my first github message)

But it is read as àe by the gedcom4j decoder.

I've made some tests in the code. I think the problem is that the decoder writes some diacritic unicode characters (\0300 in this case), which are combining with the character before itself, and not after itself.

I've tried to change the order of this diacritic character with the character after it, but there is some special cases where it causes errors: especially when the diacritic character is found at the end of a LINE/CONC line.

frizbog commented 8 years ago

Ok, I think I understand now...thanks for clarifying that. Would you be willing to share your GEDCOM file with me and I can try and reproduce and correct the issue? If you could email it to support@gedcom4j.org I could take a look. If you do, please let me know specifically which data element (person's name, place name, etc), and what the expected value is.

If you're not willing to do that, perhaps you can post here a cut/paste of a portion of the file that contains the data that is not reading correctly, redacted as needed, so I can make my own test file that replicates the situation you describe?

frizbog commented 8 years ago

Thanks for sending your file, feloy. I have confirmed your issue and am working on it. I will keep progress notes here on this issue.

frizbog commented 8 years ago

The problems we are seeing are with combining diacritics, in the 0xE0-0xFE range. According to the ANSEL spec and the MARC 21 spec which supersedes it, the combining diacritics are supposed to appear as one-byte characters preceding the one-byte character it modifies. In Unicode and UTF-16 (which is the format for Java strings), combining diacritics must follow the character being modified.

gedcom4j is failing to reverse this order when converting ANSEL-encoded data into UTF-16 encoded data for internal String representation. The result is that the combining diacritics currently modify the preceding base character rather than the character that is supposed to be modified.

As an example, a lowercase e with an acute accent should be represented in ANSEL encoding by the bytes 0xE1 0x65 - the 0xE1 is the accent, and the 0x65 is the lowercase e. When loaded by gedcom4j, the 0xE1 accent character is mapped/converted to a UTF-16 value of \u0300 (since UTF-16 is the internal format for Java strings), and then is followed by the e. However, the converted accent character is not moved to follow the e as required by Unicode/UTF-16, and therefore modifies whatever character appeared before the e.

There are two main approaches to solve this problem: 1) Move the byte of the character being modified so it appears before the byte(s) of the combining diacritic(s) when putting into the String, and reversing this process when writing ANSEL files back out. 2) Complete the conversion (wherever possible) of combination characters (diacritic+character) into pre-composed unicode characters. To extend our example, when a 0xE1 0x65 pair is found in an input file representing a lowercase e with an acute accent, this should be coded to a unicode U+00E9, which is the lowercase e with the accent already combined into a single glyph. This process would need to be reversed when writing ANSEL data from UTF-16 Java strings.

Approach 1 has the advantage of being simpler, and being more likely to preserve the fidelity of the data read from the file — the number of characters in a string will be the same as the number of bytes read from the file for that same string. However, since the ordering is already being adjusted and the extended ANSEL characters are being mapped to unicode characters anyway, fidelity is already compromised at the outset for all practical purposes.

Approach 2 has the advantage of “normalizing” the GEDCOM data to unicode/UTF-16 format internally. Rendering will be more likely to be done correctly since the glyphs are precomposed and do not require overlay rendering, and if the user wants to write the data back out as UTF-8 or unicode, no conversion will be required at all to get a preferable format.

For this reason, I plan to go with approach 2. It will take a bit longer to fix, this way, however, as the coding is more involved.

As a side note: Neither approach will solve the potential issue of string comparison and matching in these circumstances. String comparisons between the diacritic+character pair will not match unicode strings that use precomposed characters. The java.text.Normalizer class may assist here if it comes up at some point.

frizbog commented 8 years ago

Work continues. The reader seems to be working ok in the latest snapshot build, but the writer still isn't working correctly.

feloy commented 8 years ago

If this can help, I've found this discussion about converting from ANSEL to Unicode: http://heiner-eichmann.de/gedcom/charintr.htm

frizbog commented 8 years ago

Yep, I used that as a source...great site. I did find some problems there but I have a working solution I will check in as soon as I get to wifi. Thanks for that tip!

On Sep 17, 2015, at 6:31 PM, Philippe MARTIN notifications@github.com wrote:

If this can help, I've found this discussion about converting from ANSEL to Unicode: http://heiner-eichmann.de/gedcom/charintr.htm

— Reply to this email directly or view it on GitHub.

frizbog commented 8 years ago

Fixed in 2.2.2-SNAPSHOT at https://oss.sonatype.org/content/repositories/snapshots/org/gedcom4j/gedcom4j/2.2.2-SNAPSHOT/

If you would, could you please confirm that this snapshot build fully addresses your issue? If so, or if I don't hear back in a few days, I will proceed with promoting the 2.2.2 snapshot to a full release.

Thanks for your patience - this was a tricky one to fix and was an interesting problem!

feloy commented 8 years ago

Yes, I don't see any incorrect character with this snapshot. Many thanks!

frizbog commented 8 years ago

Fixed in release 2.2.2.

frizbog / gedcom4j

ANSEL input and misplaced diacritic characters #81