Encoding issues - Githubissues

f27wood commented 5 years ago

Special characters are not displaying correctly These are entered into EMu using unicode, or copy and pasted, but not transferring across to the XML file or API correctly.

Example records:

241389 Yolŋu in Description Entered into EMu as: \u{014B}

241440 Yolŋu in Description In EMU - Copy and pasted from Word

241389 ‘grass yam’ in Description Entered into EMu as: \u{2018}grass yam\u{2019}

241440 ‘raspberry-jam leaf’ in Description In EMU - Copy and pasted from Word

All of the above were entered on 21/3/19

We think the issue is the encoding in the XML is ISO 8859-1 instead of UTF-8, even thought it claims to be UTF-8

First step to troubleshoot is to change the encoding declaration in the XML to ISO 8859-1 at the API programming level. If this is the issue, we will then correct it in the XML.

Conal-Tuohy commented 5 years ago

It turns out it's not that the Unicode characters have been output in a different encoding from the declared encoding (which might be fixed just by changing the encoding).

It's worse than that; in fact the word "Yolŋu" from object 241389 has been output as YolÅu which actually specifies two Unicode characters; the character whose codepoint is hexadecimal C5, namely Å, and the character whose codepoint is hexadecimal 8B; (the "partial line forward" character). Obviously what was intended was something else. It turns out that the two-byte sequence hex C58B is the UTF-8 encoding of the hex number 014B, which is the codepoint of the character ŋ; http://www.unicode-symbol.com/u/014B.html

So it looks like a bug in the code which produces the XML file; it has 2 bytes which are the UTF-8 encoding of a character, and it just needs to write those two bytes to the XML file; it doesn't need to encode them as XML character entities.

My guess is that it probably is a bug in Simon's code, though it might be a bug in EMu, possibly.

f27wood commented 5 years ago

Here are the relevant scripts, noting that I changed the extensions to txt in order to upload them.

runemuexport.sh - wrapper that initiates the export process. dbconfig.php - sets global variables full_object_api_export.php - creates the XML file for objects

runemuexport.txt dbconfig.txt full_object_api_export.txt

Conal-Tuohy commented 5 years ago

When the export script outputs the PhyDescription XML element, it first encodes it using the utf8_encode PHP function before passing it to the XMLWriter::writeElement method, as shown here:

$XMLout->writeElement("PhyDescription",utf8_encode($row['PhyDescription']));

According to the PHP documentation, the XMLWriter class always expects to be given UTF-8 encoded text, so that's not the problem.

The documentation for the PHP function utf8_encode says that it converts a string from the ISO-8859-1 encoding into UTF-8. The fact that the end result is in error shows that the input text cannot actually have been in ISO-8859-1. On the assumption that the text was in fact already in UTF-8, I visited https://www.charset.org/utf8-to-latin-converter and input the character ŋ, and then selected the option to say that the text was an ISO-88591-encoded string that I wanted converted to UTF-8, and posted the form (which then actually transmitted the character encoded in UTF-8). In response I got the two character string Å, which replicates the problem we're seeing.

So what this shows is that the text in EMu is in fact already UTF-8 encoded. It seems to me that the code I excerpted above should be changed to:

$XMLout->writeElement("PhyDescription",$row['PhyDescription']);

... and presumably throughout the rest of the export script.

f27wood commented 5 years ago

Thanks Conal, I think you are onto something there! I will get it tested and see how we go.

f27wood commented 5 years ago

BTW am I correct in thinking that this line needs to stay as it is:

$XMLout->startDocument('1.0', 'UTF-8')

Conal-Tuohy commented 5 years ago

Correct; that specifies the encoding used in the final output document. For instance, if you put 'ISO-8859-1' in there, then the XMLWriter object would convert all the UTF-8 text you gave it into ISO-8859-1. Whereas if it says 'UTF-8' then no further conversion is needed.

f27wood commented 5 years ago

We have updated the php script, and working out how to test it on test.

There are three folders under apidata: etl, full and incremental. Where do you get the XML files from? ANd what are the names of the xml files you get?

As there is currently only files in the etl folder so not sure if full and incremental are used, or if they are copied from here once done..?

Conal-Tuohy commented 5 years ago

SOrry I missed this question last week. To confirm what I said verbally this morning; the target folder is called full, since it's a "full" export of the EMu data (the incremental folder is no longer used since we ditched the incremental operation of the ETL pipeline many months ago now -- we should probably tidy that up and remove it). The etl folder is an output folder of the ETL pipeline (it saves the EMu data there after it is imported into the API's own database).

f27wood commented 5 years ago

An update on this - we installed the updated php scripts o prod yesterday and Rick ran these manually, which all worked AOK and teh XML was updated as expected with the correct characters. I was expecting the API to update with the correct characters last night when the etl ran. However this is not the case. I am thinking that the etl did not run last night? As there was no etl log emailed.

BTW the characters are coming through AOK on the test API. Example: https://csapi-test.nma.gov.au/object/241389

Examples to use for testing follow.

Objects: 241389 241440

Narratives: 3311 3314

Party: 2171 – in title. Copy and paste, updated in lookup list: 69114

Sites / Place 1835, lookup list: 92452 St Paul’s in Value 6

Conal-Tuohy commented 5 years ago

Yes the ETL failed last night. The ETL over the weekend had failed because Friday's source data files were not there where the ETL expected, but last night's were there, and the ETL script did therefore launch the XProc pipeline to load the data into the graph store, but it has stopped at that point. I am looking into it now.

f27wood commented 5 years ago

OK thx, Rick thinks it may have been because the XML files were already there (from when he manually ran the php scripts).. I'm not so sure about that.

This is the error it threw up last night:

*

Starting Emu API full export at 13-05-2019 19:30PM
Performing full export of all EMu records marked for the API
Checking for old XML files in /mnt/isilon/apidata/full
5 existing XML files found in /mnt/isilon/apidata/full. This could be the sign of a failed ETL process.
Sending a warning email and proceeding to remove files so ETL can progress.
Removing /mnt/isilon/apidata/full/2019-05-13_11-35_objects_86312_FULL.xml
Removing /mnt/isilon/apidata/full/2019-05-13_11-35_parties_26842_FULL.xml
Removing /mnt/isilon/apidata/full/2019-05-13_11-35_sites_4285_FULL.xml
Removing /mnt/isilon/apidata/full/2019-05-13_11-35_narratives_1371_FULL.xml
Removing /mnt/isilon/apidata/full/2019-05-13_11-35_accessionlots_4025_FULL.xml

Conal-Tuohy commented 5 years ago

Yes I don't think the previous files were the problem either. The ETL script never saw that edition of the data files, which had already been replaced the next time the script looked. I'm still puzzled though; the bash script etl-run-all.sh has clearly seen the files were there, since it lists them by name in its log:

2019-05-13 21:00:02 BEGIN ETL - mode=incremental, job=job_20190513_2100_incremental
Checking for existence of source data files ...
Objects file exists
Narratives file exists
Accession lots file exists
Sites file exists
Parties file exists
Piction file exists
2019-05-13 21:00:02 START ETL STEP 1 - full load to Fuseki SPARQL store
2019-05-13 21:00:02 Source files: /mnt/emu_data/full/2019-05-13_19-30_accessionlots_4025_FULL.xml /mnt/emu_data/full/2019-05-13_19-30_narratives_1371_FULL.xml /mnt/emu_data/full/2019-05-13_19-30_objects_86312_FULL.xml /mnt/emu_data/full/2019-05-13_19-30_parties_26842_FULL.xml /mnt/emu_data/full/2019-05-13_19-30_sites_4285_FULL.xml /mnt/dams_data/solr_prod1.xml
2019-05-13 21:00:02 Loading files to Fuseki public dataset

... but the log breaks off there because the XProc ETL is still running (it should definitely have finished by now). Oddly, though, the data files which the XProc script is supposedly still processing are no longer visible in that folder /mnt/isilon/apidata/full/, even though the XProc script treats them as read-only (they are only moved by the bash script when the XProc has finished).

f27wood commented 5 years ago

Hmm.. so not much you can do/see until the XProc stops running and you can see the logs?

BTW there was an issue with the permissions when Rick first ran the php scripts as he copied them as the root user which had different permissions. He since changed them to have the same permissions as the previous files. Prob not related to this issue but thought I would mention it just in case.

Conal-Tuohy commented 5 years ago

I have re-read your comment above about the file permissions and realise I misinterpreted it. I think it is in fact possible that was the root of the problem; that the bash script could see the files were there, but that the XProc script was not able to open the files. The oddity is that the files are not there now in the full folder, even though the bash script never got up to the point where it would have deleted them. Did Rick delete them? I have killed the ETL process, BTW, since it was clearly going nowhere. I would have relaunched it, but because the full folder is empty, there's no point. So I'm hoping that when tonight's files are copied over, it will work again, or at least give me something more to go on.

f27wood commented 5 years ago

Yes Rick deleted the files. So yes lets wait and see if it runs Ok tonight. fingers crossed!

Conal-Tuohy commented 5 years ago

This issue is sorted now, is it not?

f27wood commented 5 years ago

Yes fixed! Will close.

NationalMuseumAustralia / Collection-API

Encoding issues #144