Closed ctschroeder closed 8 years ago
I see this, running locally: Coptic_edition: Chaîne (1960), § 5 p. 2
I see no faulty characters in the metadata in my local database (which happens to be SQL Lite [I refuse to call it SQLite because it’s silly to leave a letter out like that]). Next thing to check might whether the production database and tables need to have a UTF-8 setting enabled.
They’re wrong in the production database for some reason. In the experiment below, from the mysql client, I can save and recall a row with a column with a value of Tést.
mysql> select * from texts_textmeta where value like '%1960%' limit 1;
+--------+----------------+--------------------------------+-------+-------------+
| id | name | value | pre | corpus_name |
+--------+----------------+--------------------------------+-------+-------------+
| 108000 | Coptic_edition | ChaÎne (1960), § 5 p. 2 | 23313 | |
+--------+----------------+--------------------------------+-------+-------------+
1 row in set (0.00 sec)
mysql> show full columns from texts_textmeta;
+-------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+-------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
| id | int(11) | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| name | varchar(200) | utf8_unicode_ci | NO | | NULL | | select,insert,update,references | |
| value | varchar(200) | utf8_unicode_ci | NO | | NULL | | select,insert,update,references | |
| pre | varchar(200) | utf8_unicode_ci | NO | | NULL | | select,insert,update,references | |
| corpus_name | varchar(200) | utf8_unicode_ci | NO | | NULL | | select,insert,update,references | |
+-------------+--------------+-----------------+------+-----+---------+----------------+---------------------------------+---------+
5 rows in set (0.00 sec)
mysql> insert into texts_textmeta(name, value, pre, corpus_name) values('test', 'Tést', '', '');
Query OK, 1 row affected (0.02 sec)
mysql> select * from texts_textmeta order by id desc limit 1;
+--------+------+-------+-----+-------------+
| id | name | value | pre | corpus_name |
+--------+------+-------+-----+-------------+
| 115128 | test | Tést | | |
+--------+------+-------+-----+-------------+
1 row in set (0.00 sec)
mysql> delete from texts_textmeta where value = 'Tést';
Query OK, 1 row affected (0.02 sec)
Here’s a new twist. Some of the text metadata is correct and some is not. From the test system:
INSERT INTO texts_textmeta
(name
, value
, pre
, corpus_name
) VALUES ('Coptic_edition', ...
'ChaÎne (1960), § 6 p. ', '23461', 'AP.006.n196.worms')
'ChaÎne (1960), § 18 (Fragments de Naples),p. 4', '23463', 'AP.018.n372.anger')
'ChaÎne (1960), § 19 (Fragments de Naples), p. 4', '23465', 'AP.019.presbyter')
'Chaîne (1960), § 22 pp. 4-5', '23467', 'AP.022.isaac-cells.08')
'Chaîne (1960), § 23 (Fragments de Vienne),pp. 4-5', '23469', 'AP.023.isaac-cells.07')
'Chaîne (1960), § 24 (Fragments de Vienne),p. 5', '23471', 'AP.024.isaac-cells.07')
'Chaîne (1960), § 25 (Fragments de Vienne), p. 5', '23473', 'AP.025.isaac-cells.12')
'Chaîne (1960), § 23,24 (Fragments de Vienne),p. 5', '23475', 'AP.026.cassian.07')
'ChaĂŽne (1960)', '23477', 'AP.027.pistamon.01')
'ChaĂŽne (1960)', '23479', 'AP.028.serapion.02')
'ChaĂŽne (1960)', '23481', 'AP.029.syncletica.05')
'ChaĂŽne (1960)', '23483', 'AP.030.hyperechios.06')
'ChaĂŽne (1960)', '23485', 'AP.031.philagrios.01')
'ChaĂŽne (1960)', '23487', 'AP.032.benjamin.05')
'ChaĂŽne (1960)', '23489', 'AP.033.bessarion.06')
'Chaîne (1960), § 34 (Fragments de Vienne), p. 6', '23491', 'AP.034.theodore-pherme.02')
'ChaĂŽne (1960)', '23493', 'AP.035.theodore-pherme.24')
'ChaÎne (1960), § 53 (Fragment de Londres), p. 12', '23495', 'AP.053.unid.ammona')
'Chaîne (1960), § 90 (Fragments de Vienne)', '23497', 'AP.090.olympius.01')
'ChaÎne (1960), § 114 (Fragments de Naples), p. 26', '23499', 'AP.114.theophilus.02')
'ChaÎne (1960), § 125 (Fragments de Naples), p. 28', '23501', 'AP.125.orsisius.01')
'Chaîne (1960), § 157 (Fragments de Naples)', '23503', 'AP.157.paphnutius.02')
'Chaîne (1960), § 172 (Fragment de Naples)', '23505', 'AP.172.unid.antony')
'Chaîne (1960), § 177 (Fragments de Paris), p. 42', '23507', 'AP.177.ephrem.01')
'Chaîne (1960), § 216 (Fragments de Naples)', '23509', 'AP.216.bessarion.01')
'Chaîne (1960), § 217 (Fragments de Naples), p. 63', '23511', 'AP.217.bessarion.02')
Could the ANNIS data be different?
@amir-zeldes will need to chime in on this. I don't see a difference on a visual inspection in ANNIS. I also tried searching for |Chaîne in the metadata and got hits that correspond to documents that appear correct and that appear incorrect in data.copticscriptorium.org
OK, this is very strange. I've figured out one correlate of the errors, which is that files that are stored as .xls don't have the error (e.g. AP22), but xlsx do (AP6). That seems pretty consistent. The strange thing is, when I paste out the symbol into a text editor, both produce the same result.
So I went and checked the PAULA data and relANNIS - the symbols look exactly the same. I tried a binary compare in Python and received that both are byte 0xc3. There doesn't seem to be any difference in the files that ANNIS is being fed. Here's the PAULA XML that corresponds to the ANNIS input:
(good file)
<featList xmlns:xlink="http://www.w3.org/1999/xlink" type="Coptic_edition" xml:base="AP.AP.022.isaac-cells.08.anno.xml">
<feat xlink:href="#anno_1" value="Chaîne (1960), § 22 pp. 4-5"/>
</featList>
(bad file)
<featList xmlns:xlink="http://www.w3.org/1999/xlink" type="Coptic_edition" xml:base="AP.AP.006.n196.worms.anno.xml">
<feat xlink:href="#anno_1" value="Chaîne (1960), § 6 p. "/>
</featList>
My text editor sees no difference between the special characters here. Any idea Dave? Is it maybe applying some special encoding to these entries because of the presence of some other character? I only looked at the i with the circumflex.
Do the two files have XML headers with an encoding set, like this?
<?xml version="1.0" encoding="UTF-8"?>
Yeah, they're both UTF-8 in the header, and in the file itself as well (UTF-8, no BOM). I was talking to Carrie and Beth, and we actually have another theory: it seems that another distinguishing factor between the two types of files is the value of a different metadatum, called Greek_source.
In the working documents, this field has the name of the document in polytonic Greek (with accents), which requires Unicode to render correctly. The broken documents have only English describing the Greek source. Is it possible that the presence of the Greek symbols in a different metadatum kicks the browser into a different mode, in which it treats the accented 'i' in the other field differently? Some sort of auto-encoding recognition behavior perhaps?
Is it possible that the presence of the Greek symbols in a different metadatum kicks the browser into a different mode ...
I think we can rule out the browser because after the data is fetched from ANNIS it is stored into the database, where it appears wrong when viewing it with a MySQL client.
What about the headless browser which harvests the data in the first place?
The headless browser gets the visualizations, right? The metadata fetch is a separate operation through a simple HTTP GET, then using Beautiful Soup.
Oh, right... In that case I continue to be stumped. I can't see a difference in the ANNIS table dumps though, which is where the data ultimately comes from. Do you have some diagnostic tools that could maybe see something I'm missing?
Here's where the raw data enters ANNIS, and I cannot spot any difference between the working and non-working rows:
Do you see something? Compare AP6 and AP22 on lines 31 and 107 of that file - looks identical, no?
I’ve added logging of the XML metadata response to the test system, and I’m running that now.
Another thought: can you implant a Greek character from the Coptic_edition field into one of the broken documents' requests and see if that fixes it? If so, we'll know we're on the right track. For example AP22 in the apophthegmata corpus will have polytonic Greek in this field.
I haven’t seen the problem recur since I switched from urllib.request to request: https://github.com/CopticScriptorium/cts/commit/6540dbfca1187fc2349b882a5ac9b7e571b6e443
Observe in the test system.
Works in the test system in Chrome and Firefox. HURRAY!!!
It’s not working in production.
hmmm
So this means you've deployed the code on the production server but it's not working there? It looks like the difference is still what Amir noticed: the ones with Greek in other fields work. E.g. http://data.copticscriptorium.org/texts/ap/ap022isaac-cells08/norm and http://data.copticscriptorium.org/texts/ap/ap023isaac-cells07/norm but not http://data.copticscriptorium.org/texts/ap/ap029syncletica05/norm
@oneericjohnson can you think of a reason why the test instance and the production instance would be displaying characters differently?
@oneericjohnson says that the main things that are different between the test and production servers are the mySQL setup, browser setup, and python setup. Since the data is corrupt in the mySQL database, he suggests updating the browser software on the production server. (Perhaps updating all the software?) And then restarting the server.
Could it be a default locale issue? Either of the OS or the mySQL installation?
@dcbriccetti: this is working on the test server, right? The VM Eric set up at home (using the code in the master branch here) still has probs with the UTF-8 characters. Is the fix for this issue merged into master or on a separate branch? Thanks.
I never saw the problem appear on the test server. All my work to date on this issue is in master. So it looks like the problem persists. :-(
Still, the new production server was worth doing for the additional memory and in-place DBMS.
Seeing this on the go, but quick thought - can we compare db locales? Maybe they have different defaults
sent from my mobile On May 21, 2016 6:46 PM, "Dave Briccetti" notifications@github.com wrote:
I never saw the problem appear on the test server. All my work to date on this issue is in master. So it looks like the problem persists. :-(
Still, the new production server was worth doing for the additional memory and in-place DBMS.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/CopticScriptorium/cts/issues/84#issuecomment-220804387
The test server is running from the code on the master branch?
DB settings and version all identical. http://52.27.80.198/ Browser on this one is a newer version of Chrome, so to test if it's the browser, he has to install an older one, and that would take some investigation.
Working See Chaîne: http://52.27.80.198/texts/ap/ap006n196worms/analytic See Amélineau: http://52.27.80.198/texts/acephalous_work_22/a22ya517-518/norm
Will redirect dns to this server.
UTF characters in metadata not appearing properly (e.g., Chaîne’s name, Amélineu’s name (and other foreign words with accents, diacritics) in the visualizations.
See Coptic_edition here http://data.copticscriptorium.org/texts/ap/ap005unidsenses for an example of Chaîne being mangled.
Luke Hollis the week of June 23 said the problem was with the text formatting in ANNIS (?). He also suggested changing font in the “theme development” aspect of SQL database: in the static templates using the jinja formatting language (???), saved as HTML files. (This was getting into boutique languages that I could not understand, so I may not have accurately described them here.)