Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

UnicodeEncodeError #77

Closed tpmccallum closed 9 years ago

tpmccallum commented 9 years ago

Hi, I have an error when running the make file. I have pasted the message below. It seems as though I have an encoding issue, however I have been very careful to encode everything using utf8 (including filenames) and am therefore not so sure the issue is with my content. Any help will be greatly appreciated.

python OneClick.py database_metadata
Making a SQL table to hold the catalog data
loading data into catalog using LOAD DATA LOCAL INFILE...
/var/www/html/mccallum/bookworm/CreateDatabase.py:79: Warning: Row 3516 doesn't contain data for all columns
  cursor.execute(sql)
/var/www/html/mccallum/bookworm/CreateDatabase.py:84: Warning: Row 3516 doesn't contain data for all columns
  cursor.execute(sql)
Traceback (most recent call last):
  File "OneClick.py", line 211, in <module>
    getattr(program,method)()
  File "OneClick.py", line 88, in database_metadata
    Bookworm.load_book_list()
  File "/var/www/html/mccallum/bookworm/CreateDatabase.py", line 180, in load_book_list
    self.variableSet.loadMetadata()
  File "/var/www/html/mccallum/bookworm/variableSet.py", line 682, in loadMetadata
    db.query(loadcode)
  File "/var/www/html/mccallum/bookworm/CreateDatabase.py", line 84, in query
    cursor.execute(sql)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 176, in execute
    if not self._defer_warnings: self._warning_check()
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 92, in _warning_check
    warn(w[-1], self.Warning, 3)
  File "/usr/lib/python2.7/warnings.py", line 29, in _show_warning
    file.write(formatwarning(message, category, filename, lineno, line))
  File "/usr/lib/python2.7/warnings.py", line 38, in formatwarning
    s =  "%s:%s: %s: %s\n" % (filename, lineno, category.__name__, message)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 43-44: ordinal not in range(128)
make: *** [files/targets/database_metadata] Error 1
bmschmidt commented 9 years ago

This seems to involve not the ingested data itself, but the MySQL query used to load it in.

Is it possible that one of the metadata fields on your bookworm is identified with a unicode key ("nóm_de_plume", or something)? It looks like the current code may require those to be ASCII. (Which is not ideal or documented.)

tpmccallum commented 9 years ago

Hi Ben, thanks for the speedy reply. I really appreciate your assistance. How would I check the metadata field? I am not sure how to resolve this at present. FYI as previously mentioned I converted all to the data to UTF8 (do I need to perform a hack to get around this error?)

Please see examples of my files below; hope this helps

This is my JSON descriptions file

[
{"field":"date","datatype":"time","type":"numeric","unique":true,"derived":[{"resolution":"year"}]},
{"field":"searchstring","datatype":"searchstring","type":"text","unique":true}
]

This is the JSON descriptions derived file

[{"datatype": "searchstring", "field": "searchstring", "unique": true, "type": "text"}, {"datatype": "time", "field": "date_year", "unique": true, "type": "integer"}]

This is an excerpt from the catalog.txt file

1       httpsdoajorgarticlece1cd34775a44f9ba6dca9579bcdd60a     <a href="https://doaj.org/article/ce1cd34775a44f9ba6dca9579bcdd60a" target="_blank">Causes, epidemiology, and long-term outcome of traumatic cataracts in children in rural India - 2012</a>    2012

This is an excerpt from the jsoncatalog.txt

Thanks again Tim

tpmccallum commented 9 years ago

I managed to get this to work with a minute amount of records (hand selected) I need to figure out why it breaks with the complete data set and my guess is that I have a character somewhere in my data which is the cause of this. Any suggestions of cleaning all files/data (a way of scripting which throws away anything which will break the make file)?

tpmccallum commented 9 years ago

Ok I found the culprit. It seems that when I was creating json catalog file the json.dumps was returning non utf-8.

def createJsonCatalogTxt(year, filename, searchString):
    jsonObject1 = {u"date": int(year), u"filename": filename, u"searchstring": searchString}
    return json.dumps(jsonObject1)

All of the other text was written to files using codecs etc so very clean. The json.dumps seems to make it impossible to return utf-8. I tried using the following

json.dumps(jsonObject1, ensure_ascii=False).encode('utf8'))

but that gives me the same error

UnicodeDecodeError: 'ascii' codec can't decode byte ...

Can anyone offer advice on how to ensure utf8 is returned from the function above?

Other than actually knowing what encoding I am fetching from the web when harvesting (and encoding, decoding explicitly) I am out of ideas, however I did come up with a way to clean the files before running the bookworm make file.

the code in a nutshell

note that I am changing the default encoding and therefore recommend running this in the console as a one off.

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

import os
import sys
import shutil
import os.path
import codecs

def convert2utf8(fullPathy):
    newfilename = fullPathy + '.bak'
    shutil.copy(fullPathy, newfilename)
    ready = codecs.open(newfilename, 'rb')
    writy = codecs.open(fullPathy, 'wb', 'utf-8')
    for line in ready:
        writy.write(line)
    ready.close()
    writy.close()

currentDir = os.getcwd()
textFilesDir = os.path.join(currentDir, "textFiles")
for root, dirs, files in os.walk(textFilesDir):
    for file in files:
        fullPathy = os.path.join(textFilesDir, file)
        print fullPathy
        convert2utf8(fullPathy)
bmschmidt commented 9 years ago

OK, trying to remember my unicode rules.

Valid JSON should be in ASCII with unicode characters represented as points. So here's a dict with one non-unicode point (a windows-1232 curly quote at point 96) and one unicode point (a smiley, at point 263A.)

import json

mixed_encodings = {"Windows-1232":"\x96","Unicode":"☺"}
json.dumps(mixed_encodings)

There's no way to represent point `\x96' in unicode, so the best choice is just to replace it with a � character, which you can do by decoding to utf-8 with the "replace" option.

def coerce_to_utf(mystery_string):
    return mystery_string.decode("utf-8","replace").encode("utf-8")

mixed_encodings = {"Windows-1232":"\x96","Unicode":"☺"}

for key,val in mixed_encodings.iteritems():
    mixed_encodings[key] = coerce_to_utf(val)

print json.dumps(mixed_encodings)

So to plow through data of unknown character encoding, what about just something like this: strip out all non-unicode characters in the decode, and then re-encode as utf-8.

def createJsonCatalogTxt(year, filename, searchString):
    jsonObject1 = {u"date": int(year), u"filename": filename.decode("utf-8","ignore").encode("utf-8"), u"searchstring": searchString.decode("utf-8","ignore").encode("utf-8")}
    return json.dumps(jsonObject1)

I don't know if this kind of coercion should be done automatically--we actually try to do it on the texts currently, but not on the metadata because it's technically illegal to create JSON with invalid characters.

Just a note for the record: this is vaguely connected to the closed backslash issue, because it has to do with the validity of the jsoncatalog.txt file.

tpmccallum commented 9 years ago

Thanks Ben, I have started modifying my code, I am spending some time gathering more data and will re run this in the next few days. Chat soon Tim