JoKnopp / dmoz2db

A database importer for the open directory project (aka dmoz) data
20 stars 18 forks source link

Charset issue #2

Open bsgreenb opened 12 years ago

bsgreenb commented 12 years ago

I followed the instructions at http://cameronyule.com/2008/07/configuring-mysql-to-use-utf-8 to change my charset, and it looks like it changed them:

mysql> SHOW VARIABLES LIKE 'character_set%'; +--------------------------+--------------------------------------------------------+ | Variable_name | Value | +--------------------------+--------------------------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/mysql-5.5.24-osx10.6-x86_64/share/charsets/ | +--------------------------+--------------------------------------------------------+ 8 rows in set (0.00 sec)

However, when I run your script with python dmoz2db.py it trips part way through:

[INFO]: Reading configuration from: db.conf [INFO]: Missing port configuration. Trying to connect with default values. [INFO]: Missing driver configuration. Trying to connect with default values. [INFO]: Successfully connected to database [INFO]: Dropping existing tables [INFO]: Initialising tables [INFO]: Starting first parse of structure.rdf.u8 .................... - 200000 Topics parsed .................... - 400000 Topics parsed .................... - 600000 Topics parsed ................. [INFO]: done - added all Topics to the database (took 0:26:48.701962) [INFO]: Generating father ids

[INFO]: Father id generation successful (took 0:00:00.001045) [INFO]: Starting second parse of structure.rdf.u8 .Traceback (most recent call last): File "dmoz2db.py", line 339, in parse(xmlstream, structure_handler) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/init.py", line 33, in parse parser.parse(source) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 207, in feed self._parser.Parse(data, isFinal) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 301, in start_element self._cont_handler.startElement(name, AttributesImpl(attrs)) File "/Users/ben/sources/dmoz2db/src/lib/handler.py", line 173, in startElement self.topic.store_in_db(self.engine) File "/Users/ben/sources/dmoz2db/src/lib/structure.py", line 244, in store_in_db self._store_related(engine, conn) File "/Users/ben/sources/dmoz2db/src/lib/structure.py", line 168, in _store_related rtop_dbdata = conn.execute(sel_by_top, tname=rel_top) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1191, in execute params) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement return self.__execute_context(context) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context context.parameters[0], context=context) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute context) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute context) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/default.py", line 299, in do_execute cursor.execute(statement, parameters) File "build/bdist.macosx-10.6-intel/egg/MySQLdb/cursors.py", line 174, in execute File "build/bdist.macosx-10.6-intel/egg/MySQLdb/connections.py", line 36, in defaulterrorhandler sqlalchemy.exc.OperationalError: (OperationalError) (1267, "Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='") 'SELECT categories.catid, categories.Topic, categories.Title, categories.Description, categories.lastupdate, categories.letterbar, categories.fatherid \nFROM categories \nWHERE categories.Topic = %s' ('Top/World/Arabic/\xd9\x81\xd9\x86\xd9\x88\xd9\x86/\xd8\xa3\xd8\xaf\xd8\xa8',)

bsgreenb commented 12 years ago

Ah, looks like the issue is likely with collation:

mysql> show variables like 'collation%'; +----------------------+-------------------+ | Variable_name | Value | +----------------------+-------------------+ | collation_connection | utf8_general_ci | | collation_database | latin1_swedish_ci | | collation_server | utf8_general_ci | +----------------------+-------------------+ 3 rows in set (0.00 sec)

bsgreenb commented 12 years ago

Hmm, even with the variables right it's running into the issue later on:

Top/World/Arabic/فنون/أدب Traceback (most recent call last): File "dmoz2db.py", line 339, in parse(xmlstream, structure_handler) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/init.py", line 33, in parse parser.parse(source) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 107, in parse xmlreader.IncrementalParser.parse(self, source) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 207, in feed self._parser.Parse(data, isFinal) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/sax/expatreader.py", line 301, in start_element self._cont_handler.startElement(name, AttributesImpl(attrs)) File "/Users/ben/sources/dmoz2db/src/lib/handler.py", line 173, in startElement self.topic.store_in_db(self.engine) File "/Users/ben/sources/dmoz2db/src/lib/structure.py", line 245, in store_in_db self._store_related(engine, conn) File "/Users/ben/sources/dmoz2db/src/lib/structure.py", line 169, in _store_related rtop_dbdata = conn.execute(sel_by_top, tname=rel_top) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1191, in execute params) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement return self.__execute_context(context) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context context.parameters[0], context=context) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute context) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute context) File "/Library/Python/2.7/site-packages/sqlalchemy/engine/default.py", line 299, in do_execute cursor.execute(statement, parameters) File "build/bdist.macosx-10.6-intel/egg/MySQLdb/cursors.py", line 174, in execute File "build/bdist.macosx-10.6-intel/egg/MySQLdb/connections.py", line 36, in defaulterrorhandler sqlalchemy.exc.OperationalError: (OperationalError) (1267, "Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='") 'SELECT categories.catid, categories.Topic, categories.Title, categories.Description, categories.lastupdate, categories.letterbar, categories.fatherid \nFROM categories \nWHERE categories.Topic = %s' ('Top/World/Arabic/\xd9\x81\xd9\x86\xd9\x88\xd9\x86/\xd8\xa3\xd8\xaf\xd8\xa8',)

Here are the (now correct) variables:

mysql> show variables like "%character%";show variables like "%collation%"; +--------------------------+--------------------------------------------------------+ | Variable_name | Value | +--------------------------+--------------------------------------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/mysql-5.5.24-osx10.6-x86_64/share/charsets/ | +--------------------------+--------------------------------------------------------+ 8 rows in set (0.00 sec)

+----------------------+-----------------+ | Variable_name | Value | +----------------------+-----------------+ | collation_connection | utf8_general_ci | | collation_database | utf8_general_ci | | collation_server | utf8_general_ci | +----------------------+-----------------+ 3 rows in set (0.00 sec)

What's going on?

JoKnopp commented 12 years ago

I am sorry if the code isn't working as is but I am no longer maintaining it. I just leave it here for anyone who is interested in it. If you have something to pull from I'd be happy to do that.

sindreu commented 9 years ago

I had the same issue I believe.

Try droping and creating the table after the utf8 changes, if the tables were created before they will have the old settings.