International characters issues

GeorgesDesilets commented 5 years ago

Describe the bug The location tabs does not recognize french characters anymore since version update 7.0.3.1. This used to work in previous version. For instance, Montréal, Canada shows up as MontrA©al, Canada

-- To reproduce characters issue

On Windows 10 running Chrome Browser, import a GEDCOM file from (Ancestry for instance), with french characters.

ShammyLevva commented 5 years ago

Very weird. The update to 7.0.3.1 was to fix a problem with it not recognising accented characters.

ShammyLevva commented 5 years ago

The sample data I have therefore doesn’t show this issue. Can you attach a single person GEDCOM so I can see why it’s different with your file please.

GeorgesDesilets commented 5 years ago

Unfortunately, I am having a hard time finding a tool to export a set of person without modiying the data. I tried a few, and each time, the resulting Gedcom file does not show the problem once loaded back into FTAnalyser, special characters all showing ok or either replaced by regular characters (é becomes e).... Looks related specifically to the gedcom file exported from Ancestry. I wish I could share my hole files with you, but it contains private data I do not wish to share with the hole planet...

ShammyLevva commented 5 years ago

You can just use notepad.

There’s a header record then each individual will start with a zero. So just cut n paste the record of individual into a next text file. The header (bit before the first 0 @Inumber@ INDI record) and one individual.

GeorgesDesilets commented 5 years ago

Hi, I just realised that simply opening the gedcom file downloaded from Ancestry in notepad.exe and resaving is enough to fixe the problem in FTAnalyser. Using diff tool utility, there are no different between the files, except that the new file is 3 bytes bigger and character set differs from 'UTF-8' to 'UTF-8 BOM'. See in file diff header below.

[image: image.png] [image: image.png]

UTF-8 file in FTAnalyser (original from Ancestry) [image: image.png]

UTF-8 BOM file in FTAnalyser (after open and saved in notepad) [image: image.png]

Georges Desilets, Montréal, Québec

Le sam. 24 nov. 2018 à 14:59, Alexander Bisset notifications@github.com a écrit :

You can just use notepad.

There’s a header record then each individual will start with a zero. So just cut n paste the record of individual into a next text file. The header (bit before the first 0 @Inumber@ INDI record

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ShammyLevva/FTAnalyzer/issues/48#issuecomment-441392007, or mute the thread https://github.com/notifications/unsubscribe-auth/ArLgjA6OoqBwyV66zdQSXEZQoEtS9SVGks5uyaUwgaJpZM4YwC3n .

GeorgesDesilets commented 5 years ago

It seem I cannot send you the screen capture using gitHub....

Le sam. 24 nov. 2018 à 18:23, Georges Désilets ve2xdg+georges@gmail.com a écrit :

Here is the diff file. An a sample from the gedcom file.

Le sam. 24 nov. 2018 à 18:18, Georges Désilets ve2xdg+georges@gmail.com a écrit :

Hi, I just realised that simply opening the gedcom file downloaded from Ancestry in notepad.exe and resaving is enough to fixe the problem in FTAnalyser. Using diff tool utility, there are no different between the files, except that the new file is 3 bytes bigger and character set differs from 'UTF-8' to 'UTF-8 BOM'. See in file diff header below.

[image: image.png] [image: image.png]

UTF-8 file in FTAnalyser (original from Ancestry) [image: image.png]

UTF-8 BOM file in FTAnalyser (after open and saved in notepad) [image: image.png]

Georges Desilets, Montréal, Québec

Le sam. 24 nov. 2018 à 14:59, Alexander Bisset notifications@github.com a écrit :

You can just use notepad.

There’s a header record then each individual will start with a zero. So just cut n paste the record of individual into a next text file. The header (bit before the first 0 @Inumber@ INDI record

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ShammyLevva/FTAnalyzer/issues/48#issuecomment-441392007, or mute the thread https://github.com/notifications/unsubscribe-auth/ArLgjA6OoqBwyV66zdQSXEZQoEtS9SVGks5uyaUwgaJpZM4YwC3n .

ShammyLevva commented 5 years ago

Hmm I wonder what’s going on there. I suspect the change I made to use code page 1252 rather than iso-westerneuropean. I’m on holiday this week so will check when I get back. I suspect the solution is to add an option to let the user select a encoding format to find one that works for them.

ShammyLevva commented 5 years ago

At the top of the GEDCOM file there will be a line that indicates encoding. Is this what is changing?

ShammyLevva commented 5 years ago

Btw you could try pasting the images on the Facebook user group. Facebook.com/groups/FTAnalyzer

GeorgesDesilets commented 5 years ago

The code is in the header is UTF-8. I have posted the images on the facebook group.

Le dim. 25 nov. 2018 04 h 41, Alexander Bisset notifications@github.com a écrit :

At the top of the GEDCOM file there will be a line that indicates encoding. Is this what is changing?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ShammyLevva/FTAnalyzer/issues/48#issuecomment-441427675, or mute the thread https://github.com/notifications/unsubscribe-auth/ArLgjLozwPOdwN8HcK_njsxgksC7Y7cqks5uymXkgaJpZM4YwC3n .

amerkasem85 commented 5 years ago

There is a problem with exported CSV file, it's not support Arabic language all exported name encrypted thanx

ShammyLevva commented 5 years ago

Do you have a sample GEDCOM file with Arabic characters in it amerkasem85? I wouldn’t know how to go about creating a test file with them.

Other than the Arabic characters that every tree contains (digits 1-9).

fire-eggs commented 5 years ago

I created a simple "Arabic" GED by editing one from the interwebs and replacing a couple of surnames with a string of random Arabic characters copied from Character Map.

(I apologize in advance to any Arabic readers, I have no idea what I wrote ...)

test_arabic_utf8.zip

It is a UTF-8 file because I couldn't figure out how to get matching characters using ANSI.

Issue 1: the names are shown as input in the lists:

Issue 2: Exported to CSV, the names are not shown as input:

Note: changing FTA's export to CSV file format from "Western European" to the default (UTF-8) made no real difference, which suggests to me the issue is on the import side.

I see the same behavior with another file I found on the Interwebs, which uses Hebrew characters: hebrew05122008_utf8.zip

A related note: Excel will not open a UTF-8 csv file correctly. The file will have to be imported, and in the Text Import wizard, select "Unicode (UTF-8)" as the "File origin" option.

fire-eggs commented 5 years ago

It took me another break for me to realize what is going on.

FTA (currently) imports using the Windows-1252 character set, unless overridden by the BOM!

I should have realized it when @GeorgesDesilets mentioned switching on the BOM.

Taking my test_arabic_utf8.ged from earlier, switching it to UTF-8(BOM), this is what I get in FTA:

and the export to CSV is fine:

As I said before, Excel will not open the resulting CSV file correctly, but can import it correctly, which might be what @amerkasem85 is actually reporting?

ShammyLevva commented 5 years ago

Hmm interesting so perhaps the key fix here is to look for BOM and add one if missing so the current importing routine works with both types. Thoughts?

fire-eggs commented 5 years ago

I don't think you can "add a BOM" - you'd be modifying the input file. The BOM is optional; a UTF-8 file should be read as UTF-8 even without the BOM. Here, .NET just happens to be changing behavior when a BOM is seen.

The "usual" approach is to read the file as ASCII until the HEAD.CHAR tag is found. If that tag specifies UTF-8 or Unicode or something else, the file has to be re-opened using the appropriate encoding. The existence of the BOM short-circuits this process: it means the file is UTF-8, regardless of what the HEAD.CHAR tag reads.

Tamura Jones goes into lots of detail.

Myself, I'd suggest always importing as UTF-8. I think ASCII files will import OK if you do. The question would be whether you have users with Unicode, ANSEL, or non-Windows-1252 codepage files who might be impacted.

ShammyLevva commented 5 years ago

Yes I'm aware of what the GEDCOM says. However I don't believe the issue is anything to do with the GEDCOM tags in the file. I much more strongly suspect that files are being written with a standard HEAD.CHAR field that has got nothing to do with the actual content.

So I think that program (or website) X will always put HEAD.CHAR as UTF-8 for example regardless of what character set is used. Many don't even have a HEAD.CHAR value.

I had loads of issues with reading as UTF-8 simply not understanding a lot of special characters this was all largely resolved using 1252 character set and ignoring the HEAD.CHAR setting.

Changing default to UTF-8 from 1252 works for arabic fails for hebrew and fails for french. There's a hideous combo of the HEAD.CHAR settings and file encoding going on here.

I'm wondering if there's some way to test if the import failed to translate the characters. At present I feel I'm going round in circles.

ShammyLevva commented 5 years ago

I've sent George a PM on facebook asking for his sample file then I think I'll build a series of test cases to load sample data with different variants of encodings and HEAD.CHAR settings and add a function to measure success.

Once there's a series of test cases it's easier to tweak things and see what works and what doesn't with various settings. At present I'll commit what I've got and build some tests.

ShammyLevva commented 5 years ago

Right thanks for the file. Fairly sure I've got a fix which I'll release as v7.0.4.0 soon. I'll leave this open meantime. It seems to fix the Arabic and French imports. Note Excel imports need to use import routine in Excel not just double clicking (this is an Excel issue as fire-eggs pointed out).

The Hebrew file is still being annoying though so more work to do.

fire-eggs commented 5 years ago

My mistake then - I thought Win-1252 and UTF-8 would be interchangeable for ASCII. Seems I have some more research to do ...

ShammyLevva commented 5 years ago

This does seem to be ok now. Can reopen if something else arises.

amerkasem85 commented 5 years ago

@ShammyLevva I 'm sorry i was not availabe in the last version it seems not working also Sample https://www.mediafire.com/file/ufkcaocg8dzdr70/Desktop.zip/file

ShammyLevva commented 5 years ago

Ok I've had another look and found the issue

ShammyLevva commented 5 years ago

amerkasem85 commented 5 years ago

It works now Exported Arabic CSV file is ok Thanx

ShammyLevva commented 5 years ago

Great I'll close the issue.

ShammyLevva / FTAnalyzer

International characters issues #48