OneZoom / OZtree

OneZoom Tree of Life Explorer
Other
89 stars 19 forks source link

Set the 4 `name` fields to utf8mb4 #885

Open hyanwong opened 1 month ago

hyanwong commented 1 month ago

In https://github.com/OneZoom/OZtree/blob/f90d82ccab3b168fedeada4ee732c8d1641cddb2/OZprivate/ServerScripts/SQL/create_db_indexes.sql#L39 we set a number of search fields to utf8mb4. However, we also search in vernacular_by_name.name, vernacular_by_ott.name, ordered_nodes.name and ordered_leaves.name. Although these should only contain scientific ("latin") names, and so should not need full 4-byte unicode, it might be as well to add those columns to the list to MakeFullUnicode() too.

I assume that using utf8mb4 won't make searching these fields noticeably slower, but perhaps it's worth testing?

lentinj commented 1 month ago

This seems worth doing, if only for consistencies' sake.

I assume that using utf8mb4 won't make searching these fields noticeably slower, but perhaps it's worth testing?

It really shouldn't, but more tests is always good.

hyanwong commented 1 month ago

Thanks @lentinj: really helpful to know that it shouldn't make a difference (e.g. when matching on name)

more tests always good.

I'm not sure how we would automatically test performance here. I don't think we are doing perf tests at the moment.

lentinj commented 1 month ago

I don't think we are doing perf tests at the moment.

We're not, and doing so in an automated fashion is hard work. But an API exerciser producing timings would be useful here and pre/post upgrade validation (for example).

Using apachebench to do this rather than a python script would mean it could be used for load testing at some point as well.

wolfmanstout commented 1 month ago

I tested utf8mb4 and oddly, searches stopped working (no results show up at all). Here's what I ran (successfully):

call MakeFullUnicode('vernacular_by_name', 'name');
call MakeFullUnicode('images_by_name', 'name');
call MakeFullUnicode('ordered_leaves', 'name');
call MakeFullUnicode('ordered_nodes', 'name');

Notably, test_database_settings.py did pass with this change.

When I changed those all to utf8 instead, my searches became fast for the first time. One of the tests started failing, though:

  File "/opt/web2py/applications/OZtree/tests/site_setup/test_database_settings.py", line 72, in test_full_unicode_for_vernaculars
    assert v[0] == "utf8mb4", "vernacular columns should be full 4 byte unicode"

Also, in order to run these SQL queries successfully, I had to adjust server-side timeouts in /etc/mysql/my.cnf:

[mysqld]
net_read_timeout = 600
net_write_timeout = 600
interactive_timeout = 600
wait_timeout = 600
max_allowed_packet = 64M

I'm not sure exactly which of these config changes were truly needed (ChatGPT suggested these ...), but I can confidently say that client-side settings in MySQL Workbench weren't enough.

I can take a closer look later to try to see what might have caused utf8mb4 to fail. Apparently utf8mb3 is deprecated and utf8mb4 is the new default, so it seems worthwhile to make the change: https://dev.mysql.com/blog-archive/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/

hyanwong commented 1 month ago

Thanks for digging. Does your DB connection string in appconfig.ini specify the utf8mb4 encoding:

mysql://oz:passwd@127.0.0.1/OneZoom?set_encoding=utf8mb4

?

wolfmanstout commented 1 month ago

Yes:

; db configuration - using MySQL and the default web2py connector, we need to add 
;  ?set_encoding=utf8mb4 to the end of the string to cope with adding all e.g. chinese
;  characters etc into the database (e.g. for vernacular names)
[db]
uri       = mysql://oz:passwd@localhost/OneZoom?set_encoding=utf8mb4
migrate   = 1
pool_size = 1
wolfmanstout commented 1 month ago

How odd: in order to try to debug this, I reverted to utf8mb4 ... but now searches continue to work properly. In all cases I've taken care to make sure the queries completed successfully. The only rational explanation (aside from a bug somewhere) is that the change to utf8 truncated some values outside the charset that were causing trouble such that when going back to utf8mb4 it was no longer a problem. I can try recreating my container to reproduce this.

hyanwong commented 1 month ago

Ah, it could be the truncation thing, yes!

wolfmanstout commented 1 month ago

Turns out the SQL query was failing but the exception was swallowed. It points directly to the issue:

OperationalError(1267, "Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_0900_ai_ci,IMPLICIT) for operation '='")

There were a few bugs I had to solve in create_db_indexes.sql to fix this. First, it was only updating the collation of a column if the column wasn't already utf8mb4, so it would leave inconsistent collation in. Once I fixed the conditional, it was still failing, and that's because the ALTER TABLE ... CONVERT TO CHARACTER SET affects every column in the table and resets their collation (reference). Hence, multiple MakeFullUnicode calls were resetting the collation from previous calls. I think the intention here was actually ALTER TABLE ... DEFAULT CHARACTER SET. When I changed to that, it worked. Here's the full amended function:

CREATE PROCEDURE MakeFullUnicode(tablename CHAR(50), columnname CHAR(50))

  BEGIN
    DECLARE char_set TEXT;
    DECLARE vtype TEXT;
    DECLARE coll_name TEXT;

    SELECT character_set_name, column_type, collation_name INTO char_set, vtype, coll_name FROM information_schema.`COLUMNS` 
        WHERE table_schema = SCHEMA() AND table_name = tablename AND column_name = columnname;
    IF char_set != 'utf8mb4' OR coll_name != 'utf8mb4_unicode_ci' THEN 
      SET @sql_cmd = CONCAT('ALTER TABLE ', tablename,' DEFAULT CHARACTER SET utf8mb4;');
      PREPARE stmt FROM @sql_cmd;
      EXECUTE stmt;
      SET @sql_cmd = CONCAT('ALTER TABLE ', tablename,' CHANGE ', columnname, ' ', columnname, ' ', vtype, ' CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;');
      PREPARE stmt FROM @sql_cmd;
      EXECUTE stmt;
      SET @sql_cmd = CONCAT('REPAIR TABLE ', tablename, ';');
      PREPARE stmt FROM @sql_cmd;
      EXECUTE stmt;
      SET @sql_cmd = CONCAT('OPTIMIZE TABLE ', tablename, ';');
      PREPARE stmt FROM @sql_cmd;
      EXECUTE stmt;
    END IF;
  END //

With that change, searches work fine with utf8mb4.

I do want to note that this other collation, "utf8mb4_0900_ai_ci", is newer than the one that is being explicitly set here, and is the new default for utf8mb4. Among other things, it makes string comparisons insensitive to accents on letters. Hence, the simplest solution might be to drop the explicit collation, leave the CONVERT TO CHARACTER SET query as-is, and take advantage of defaults to be consistent. I suppose there might be a benefit in being explicit, though, as defaults change in new versions of MySQL, and to fix any columns that might have been changed with earlier queries.

If you'd like me to submit the most conservative fix, I'm happy to send a PR for the update shown above.

hyanwong commented 1 month ago

I think explicit is good, but I'd be happy to go with the new mysql default, or whatever seems most sensible. FYI, it looks like we are running mysql 5.7.42-log on the server, which could make a difference, I guess. A PR with a fix in would be good. Do you think we''ll have to change the db connection string (I presume not)

wolfmanstout commented 4 weeks ago

I'm going to send out a pull request shortly.

That's good to know you are running an earlier version of MySQL -- this new collation I mentioned was introduced in MySQL 8 (which is what the Docker image uses), so we will need to stick with the existing one, but we can at least apply it consistently. Regardless, I don't think we would need to change the DB connection string, which doesn't mention the collation (only charset).

I am curious why the charset mismatch that I originally reported doesn't affect your prod instance (for me, it caused both slowness and a failed test). It shouldn't have anything to do with the MySQL version: it's a direct consequence of the "ALTER TABLE ... CONVERT TO CHARACTER SET" query which is documented to affect every column in the table, so that would have unintentionally set some columns to utf8mb4. Did you not run the create_db_indexes.sql script in prod, or perhaps you ran follow up queries to fix this? I would be curious to see what these queries return:

SELECT character_set_name, column_type, collation_name FROM information_schema.`COLUMNS`
WHERE table_schema = SCHEMA() AND table_name = 'vernacular_by_name' AND column_name = 'name';
SELECT character_set_name, column_type, collation_name FROM information_schema.`COLUMNS`
WHERE table_schema = SCHEMA() AND table_name = 'ordered_leaves' AND column_name = 'name';