Some UTF 8 characters with accents not appearing properly in visualizations

ctschroeder commented 9 years ago

UTF characters in metadata not appearing properly (e.g., Chaîne’s name, Amélineu’s name (and other foreign words with accents, diacritics) in the visualizations.

See Coptic_edition here http://data.copticscriptorium.org/texts/ap/ap005unidsenses for an example of Chaîne being mangled.

Luke Hollis the week of June 23 said the problem was with the text formatting in ANNIS (?). He also suggested changing font in the “theme development” aspect of SQL database: in the static templates using the jinja formatting language (???), saved as HTML files. (This was getting into boutique languages that I could not understand, so I may not have accurately described them here.)

dcbriccetti commented 8 years ago

I see this, running locally: Coptic_edition: Chaîne (1960), § 5 p. 2

I see no faulty characters in the metadata in my local database (which happens to be SQL Lite [I refuse to call it SQLite because it’s silly to leave a letter out like that]). Next thing to check might whether the production database and tables need to have a UTF-8 setting enabled.

dcbriccetti commented 8 years ago

They’re wrong in the production database for some reason. In the experiment below, from the mysql client, I can save and recall a row with a column with a value of Tést.

mysql> select * from texts_textmeta where value like '%1960%' limit 1;
+--------+----------------+--------------------------------+-------+-------------+
| id     | name           | value                          | pre   | corpus_name |
+--------+----------------+--------------------------------+-------+-------------+
| 108000 | Coptic_edition | ChaĂŽne (1960), Â§ 5 p. 2      | 23313 |             |
+--------+----------------+--------------------------------+-------+-------------+
1 row in set (0.00 sec)

mysql> show full columns from texts_textmeta;
+-------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field       | Type         | Collation       | Null | Key | Default | Extra          | Privileges                      | Comment |
+-------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| id          | int(11)      | NULL            | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
| name        | varchar(200) | utf8_unicode_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| value       | varchar(200) | utf8_unicode_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| pre         | varchar(200) | utf8_unicode_ci | NO   |     | NULL    |                | select,insert,update,references |         |
| corpus_name | varchar(200) | utf8_unicode_ci | NO   |     | NULL    |                | select,insert,update,references |         |
+-------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
5 rows in set (0.00 sec)

mysql> insert into texts_textmeta(name, value, pre, corpus_name) values('test', 'Tést', '', '');
Query OK, 1 row affected (0.02 sec)

mysql> select * from texts_textmeta order by id desc limit 1;
+--------+------+-------+-----+-------------+
| id     | name | value | pre | corpus_name |
+--------+------+-------+-----+-------------+
| 115128 | test | Tést  |     |             |
+--------+------+-------+-----+-------------+
1 row in set (0.00 sec)

mysql> delete from texts_textmeta where value = 'Tést';
Query OK, 1 row affected (0.02 sec)

dcbriccetti commented 8 years ago

Here’s a new twist. Some of the text metadata is correct and some is not. From the test system: INSERT INTO texts_textmeta (name, value, pre, corpus_name) VALUES ('Coptic_edition', ...

'ChaĂŽne (1960), Â§ 6 p.  ', '23461', 'AP.006.n196.worms')
'ChaĂŽne (1960), Â§ 18 (Fragments de Naples),p. 4', '23463', 'AP.018.n372.anger')
'ChaĂŽne (1960), Â§ 19 (Fragments de Naples), p. 4', '23465', 'AP.019.presbyter')
'Chaîne (1960), § 22 pp. 4-5', '23467', 'AP.022.isaac-cells.08')
'Chaîne (1960), § 23 (Fragments de Vienne),pp. 4-5', '23469', 'AP.023.isaac-cells.07')
'Chaîne (1960), § 24 (Fragments de Vienne),p. 5', '23471', 'AP.024.isaac-cells.07')
'Chaîne (1960), § 25 (Fragments de Vienne), p. 5', '23473', 'AP.025.isaac-cells.12')
'Chaîne (1960), § 23,24 (Fragments de Vienne),p. 5', '23475', 'AP.026.cassian.07')
'ChaĂŽne (1960)', '23477', 'AP.027.pistamon.01')
'ChaĂŽne (1960)', '23479', 'AP.028.serapion.02')
'ChaĂŽne (1960)', '23481', 'AP.029.syncletica.05')
'ChaĂŽne (1960)', '23483', 'AP.030.hyperechios.06')
'ChaĂŽne (1960)', '23485', 'AP.031.philagrios.01')
'ChaĂŽne (1960)', '23487', 'AP.032.benjamin.05')
'ChaĂŽne (1960)', '23489', 'AP.033.bessarion.06')
'Chaîne (1960), § 34 (Fragments de Vienne), p. 6', '23491', 'AP.034.theodore-pherme.02')
'ChaĂŽne (1960)', '23493', 'AP.035.theodore-pherme.24')
'ChaĂŽne (1960), Â§ 53 (Fragment de Londres), p. 12', '23495', 'AP.053.unid.ammona')
'Chaîne (1960), § 90 (Fragments de Vienne)', '23497', 'AP.090.olympius.01')
'ChaĂŽne (1960), Â§ 114 (Fragments de Naples), p. 26', '23499', 'AP.114.theophilus.02')
'ChaĂŽne (1960), Â§ 125 (Fragments de Naples), p. 28', '23501', 'AP.125.orsisius.01')
'Chaîne (1960), § 157 (Fragments de Naples)', '23503', 'AP.157.paphnutius.02')
'Chaîne (1960), § 172 (Fragment de Naples)', '23505', 'AP.172.unid.antony')
'Chaîne (1960), § 177 (Fragments de Paris), p. 42', '23507', 'AP.177.ephrem.01')
'Chaîne (1960), § 216 (Fragments de Naples)', '23509', 'AP.216.bessarion.01')
'Chaîne (1960), § 217 (Fragments de Naples), p. 63', '23511', 'AP.217.bessarion.02')

Could the ANNIS data be different?

ctschroeder commented 8 years ago

@amir-zeldes will need to chime in on this. I don't see a difference on a visual inspection in ANNIS. I also tried searching for |Chaîne in the metadata and got hits that correspond to documents that appear correct and that appear incorrect in data.copticscriptorium.org

https://corpling.uis.georgetown.edu/annis/scriptorium#_q=bWV0YTo6Q29wdGljX2VkaXRpb249L0NoYcOubmUuKi8gJiBwYl94bWxfaWQ&_c=YXBvcGh0aGVnbWF0YS5wYXRydW0&cl=5&cr=5&s=10&l=10&_seg=bm9ybV9ncm91cA|

amir-zeldes commented 8 years ago

OK, this is very strange. I've figured out one correlate of the errors, which is that files that are stored as .xls don't have the error (e.g. AP22), but xlsx do (AP6). That seems pretty consistent. The strange thing is, when I paste out the symbol into a text editor, both produce the same result.

So I went and checked the PAULA data and relANNIS - the symbols look exactly the same. I tried a binary compare in Python and received that both are byte 0xc3. There doesn't seem to be any difference in the files that ANNIS is being fed. Here's the PAULA XML that corresponds to the ANNIS input:

(good file)

<featList xmlns:xlink="http://www.w3.org/1999/xlink" type="Coptic_edition" xml:base="AP.AP.022.isaac-cells.08.anno.xml">
<feat xlink:href="#anno_1" value="Chaîne (1960), § 22 pp. 4-5"/>
</featList>

(bad file)

<featList xmlns:xlink="http://www.w3.org/1999/xlink" type="Coptic_edition" xml:base="AP.AP.006.n196.worms.anno.xml">
<feat xlink:href="#anno_1" value="Chaîne (1960), § 6 p.  "/>
</featList>

My text editor sees no difference between the special characters here. Any idea Dave? Is it maybe applying some special encoding to these entries because of the presence of some other character? I only looked at the i with the circumflex.

dcbriccetti commented 8 years ago

Do the two files have XML headers with an encoding set, like this?

<?xml version="1.0" encoding="UTF-8"?>

amir-zeldes commented 8 years ago

Yeah, they're both UTF-8 in the header, and in the file itself as well (UTF-8, no BOM). I was talking to Carrie and Beth, and we actually have another theory: it seems that another distinguishing factor between the two types of files is the value of a different metadatum, called Greek_source.

In the working documents, this field has the name of the document in polytonic Greek (with accents), which requires Unicode to render correctly. The broken documents have only English describing the Greek source. Is it possible that the presence of the Greek symbols in a different metadatum kicks the browser into a different mode, in which it treats the accented 'i' in the other field differently? Some sort of auto-encoding recognition behavior perhaps?

dcbriccetti commented 8 years ago

Is it possible that the presence of the Greek symbols in a different metadatum kicks the browser into a different mode ...

I think we can rule out the browser because after the data is fetched from ANNIS it is stored into the database, where it appears wrong when viewing it with a MySQL client.

amir-zeldes commented 8 years ago

What about the headless browser which harvests the data in the first place?

dcbriccetti commented 8 years ago

The headless browser gets the visualizations, right? The metadata fetch is a separate operation through a simple HTTP GET, then using Beautiful Soup.

amir-zeldes commented 8 years ago

Oh, right... In that case I continue to be stumped. I can't see a difference in the ANNIS table dumps though, which is where the data ultimately comes from. Do you have some diagnostic tools that could maybe see something I'm missing?

Here's where the raw data enters ANNIS, and I cannot spot any difference between the working and non-working rows:

https://github.com/CopticScriptorium/corpora/blob/master/AP/apophthegmata.patrum_ANNIS/corpus_annotation.annis

Do you see something? Compare AP6 and AP22 on lines 31 and 107 of that file - looks identical, no?

dcbriccetti commented 8 years ago

I’ve added logging of the XML metadata response to the test system, and I’m running that now.

amir-zeldes commented 8 years ago

Another thought: can you implant a Greek character from the Coptic_edition field into one of the broken documents' requests and see if that fixes it? If so, we'll know we're on the right track. For example AP22 in the apophthegmata corpus will have polytonic Greek in this field.

dcbriccetti commented 8 years ago

I haven’t seen the problem recur since I switched from urllib.request to request: https://github.com/CopticScriptorium/cts/commit/6540dbfca1187fc2349b882a5ac9b7e571b6e443

Observe in the test system.

ctschroeder commented 8 years ago

Works in the test system in Chrome and Firefox. HURRAY!!!

dcbriccetti commented 8 years ago

It’s not working in production.

ctschroeder commented 8 years ago

hmmm

ctschroeder commented 8 years ago

So this means you've deployed the code on the production server but it's not working there? It looks like the difference is still what Amir noticed: the ones with Greek in other fields work. E.g. http://data.copticscriptorium.org/texts/ap/ap022isaac-cells08/norm and http://data.copticscriptorium.org/texts/ap/ap023isaac-cells07/norm but not http://data.copticscriptorium.org/texts/ap/ap029syncletica05/norm

ctschroeder commented 8 years ago

@oneericjohnson can you think of a reason why the test instance and the production instance would be displaying characters differently?

ctschroeder commented 8 years ago

@oneericjohnson says that the main things that are different between the test and production servers are the mySQL setup, browser setup, and python setup. Since the data is corrupt in the mySQL database, he suggests updating the browser software on the production server. (Perhaps updating all the software?) And then restarting the server.

amir-zeldes commented 8 years ago

Could it be a default locale issue? Either of the OS or the mySQL installation?

ctschroeder commented 8 years ago

@dcbriccetti: this is working on the test server, right? The VM Eric set up at home (using the code in the master branch here) still has probs with the UTF-8 characters. Is the fix for this issue merged into master or on a separate branch? Thanks.

dcbriccetti commented 8 years ago

I never saw the problem appear on the test server. All my work to date on this issue is in master. So it looks like the problem persists. :-(

Still, the new production server was worth doing for the additional memory and in-place DBMS.

amir-zeldes commented 8 years ago

Seeing this on the go, but quick thought - can we compare db locales? Maybe they have different defaults

sent from my mobile On May 21, 2016 6:46 PM, "Dave Briccetti" notifications@github.com wrote:

I never saw the problem appear on the test server. All my work to date on this issue is in master. So it looks like the problem persists. :-(

Still, the new production server was worth doing for the additional memory and in-place DBMS.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/CopticScriptorium/cts/issues/84#issuecomment-220804387

ctschroeder commented 8 years ago

The test server is running from the code on the master branch?

ctschroeder commented 8 years ago

DB settings and version all identical. http://52.27.80.198/ Browser on this one is a newer version of Chrome, so to test if it's the browser, he has to install an older one, and that would take some investigation.

ctschroeder commented 8 years ago

Working See Chaîne: http://52.27.80.198/texts/ap/ap006n196worms/analytic See Amélineau: http://52.27.80.198/texts/acephalous_work_22/a22ya517-518/norm

Will redirect dns to this server.

CopticScriptorium / cts

Some UTF 8 characters with accents not appearing properly in visualizations #84