HazyResearch / deepdive

DeepDive
deepdive.stanford.edu
1.95k stars 539 forks source link

Character encoding problem #639

Open mcavdar opened 7 years ago

mcavdar commented 7 years ago

Hello,

I'm trying to process French corpus and keep getting always result with encoding problem like 'américaine' instead of 'américaine' even after I've changed LC_ALL variable in 'shell/deepdive' file appropriately for French like that.

Any kind of help would be greatly appreciated. Thanks

edit: I've added some debug lines in tsj2corenlp-http-reqs file and realized even before corenlp request each sentences has encoding problem. edit2: I've tracked it until database/db-driver/postgresql/db-query-tsj . I think problem is about psycopg2 module. edit3: I think problem is python 2. When I tried to request with psycopg2 in python3 result has not encoding problem. But after modified database/db-driver/postgresql/db-query-tsj for python3 (#!/usr/bin/env python -> #!/usr/bin/env python3 ) I'm getting another error:

... 2017-05-25 15:00:20.582789 Loading dd_tmp_sentences from /home/mc/quaer-encode/run/process/ext_sentences_by_nlp_markup/deepdive-compute-execute.la1vGzv/output_computed-1 (tsj format) 2017-05-25 15:00:20.669839 Traceback (most recent call last): 2017-05-25 15:00:20.669914 File "/home/mc/local/util/db-driver/postgresql/db-query-tsj", line 6, in 2017-05-25 15:00:20.669939 import psycopg2, psycopg2.extras, ujson 2017-05-25 15:00:20.669961 File "/home/mc/local/lib/bundled/python-lib/prefix/lib/python2.7/site-packages/psycopg2/init.py", line 50, in 2017-05-25 15:00:20.669981 from psycopg2._psycopg import ( # noqa 2017-05-25 15:00:20.670002 ImportError: /home/mc/local/lib/bundled/python-lib/prefix/lib/python2.7/site-packages/psycopg2/_psycopg.so: undefined symbol: PyUnicodeUCS4_DecodeUTF8 2017-05-25 15:00:20.690613 /home/mc/local/util/compute-driver/local/compute-execute: ligne 129 : kill: (10546) - No such process 2017-05-25 15:00:20.693066 [ERROR] deepdive-unload: PID 10546: finished with non-zero exit status (1)...

I don't know why it tries to use bundle of python2.7. Any idea?

manning commented 7 years ago

I'd retitle this "Deepdive character encoding problem". As you've already determined, the problem isn't with CoreNLP, which handles French and character encodings just fine….