digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
367 stars 114 forks source link

Visualization module(s?) fail because of unexpected collation #440

Open xmacex opened 3 years ago

xmacex commented 3 years ago

Describe the bug

At least one of the visualization functionalities, Bipartite hashtag-mention graph implemented by mod.mention_graph.php fails and produces no output for the user.

To Reproduce

Steps to reproduce the behavior:

  1. Go to the analysis interface and select a bin
  2. Scroll down to Networks section
  3. Locate the Bipartite hashtag-mention graph functionality
  4. Click » launch
  5. Observe that the results page doesn't offer a network to download

Expected behavior

A .gexf network file to download should be offered.

Screenshots

Screenshot 2021-09-04 at 17 32 36

Environment:

Additional context

Seems to be collation issue. From Apache log:

[Thu Sep 02 12:25:19.771019 2021] [php7:error] [pid 3633248] [client 10.28.48.169:61208] PHP Fatal error:  Uncaught PDOException: SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4' in /var/www/tcat-tip/analysis/mod.mention_graph.php:56\nStack trace:\n#0 /var/www/tcat-tip/analysis/mod.mention_graph.php(56): PDOStatement->execute()\n#1 {main}\n  thrown in /var/www/tcat-tip/analysis/mod.mention_graph.php on line 56

Forensic analysis via the code to the database schema

Looking at where the error is coming from https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/mod.mention_graph.php#L56

we see that the SQL query is https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/mod.mention_graph.php#L46

The variable collation is retrieved a little bit before https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/mod.mention_graph.php#L33

The function current_collation() is defined at https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/common/functions.php#L853

In there we find first that

https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/common/functions.php#L863

and then on lines 868-873

    while ($res = $rec->fetch(PDO::FETCH_ASSOC)) {
        if (array_key_exists('Collation', $res) && ($res['Collation'] == 'utf8mb4_unicode_ci' || $res['Collation'] == 'utf8mb4_general_ci')) {
            $is_utf8mb4 = true;
            break;
        }
    }

So, the collation for the visualization module defaults to utf8_bin if the *_hashtags table does not have collation of utf8mb4_unicode_ci or utf8mb4_general_ci for any of the rows.

Now looking back at the Apache error message 1253 COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4', it seems that that utf8_bin default is used, but it's not appropriate for the table and therefore the hickup.

Let's investigate the

SHOW full columns FROM very_interesting_research_topic_hashtags;

describes the table as

+----------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
| Field          | Type         | Collation          | Null | Key | Default | Extra          | Privileges                      | Comment |
+----------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
| id             | int          | NULL               | NO   | PRI | NULL    | auto_increment | select,insert,update,references |         |
| tweet_id       | bigint       | NULL               | NO   | MUL | NULL    |                | select,insert,update,references |         |
| created_at     | datetime     | NULL               | YES  | MUL | NULL    |                | select,insert,update,references |         |
| from_user_name | varchar(255) | utf8mb4_0900_ai_ci | YES  | MUL | NULL    |                | select,insert,update,references |         |
| from_user_id   | bigint       | NULL               | YES  |     | NULL    |                | select,insert,update,references |         |
| text           | varchar(255) | utf8mb4_0900_ai_ci | YES  | MUL | NULL    |                | select,insert,update,references |         |
+----------------+--------------+--------------------+------+-----+---------+----------------+---------------------------------+---------+
6 rows in set (0.00 sec)

Right, either null of utf8mb4_0900_ai_ci collation. TCAT collation defaults to utf8_bin and isn't valid for utf8mb4* as Apache says.

A local fix kludge

If I add utf8mb4_0900_ai_ci to the line 869 of /analysis/common/functions.php by making it

if (array_key_exists('Collation', $res)
    && ($res['Collation']    == 'utf8mb4_unicode_ci'
        || $res['Collation'] == 'utf8mb4_general_ci'
        || $res['Collation'] == 'utf8mb4_0900_ai_ci')) {

(reflowed for legibility)

the visualization module works as expected, yielding a .gexf file and no errors in Apache logs. This kludge fixes it for that bin and for that function. I haven't had the change to investigate what are all the possible collations our database has for those %_hashtag columns, and where do they come from originally and if the migration scripts do something to them. There might be various collations, we've been running TCAT for some years.

Screenshot 2021-09-04 at 18 00 33

Suggestions for fixes

The open question is what are possible, expected and preferred values of that Collation column, ie. what should alternatives should be listed on the line 869 to satisfy the proposition that $is_utf8mb4 = true, or should the default on line L863 be something that covers a set of different utf8mb4* collations rather then utf8_bin? My expertise on MySQL collation is weak.

Does someone else have this? Please don't hesitate to ask for clarifications or otherwise collaborate for sorting¹ this out.

Have a good day ✨

¹ pun intended

xmacex commented 3 years ago

A cool iridescent bug landed on my keyboard just as I finished the debugging session for today

IMG_20210904_183201__01.jpg

alexandreteles commented 3 years ago

The open question is what are possible, expected and preferred values of that Collation column, ie. what should alternatives should be listed on the line 869 to satisfy the proposition that $is_utf8mb4 = true, or should the default on line L863 be something that covers a set of different utf8mb4* collations rather then utf8_bin? My expertise on MySQL collation is weak.

On the current MySQL 8 release utf8 is still an alias for utf8mb3, so utf8_bin doesn't actually cover any of the utf8mb4* collations and the code should default to utf8mb4_bin instead. The problem I see here is that if there's any utf8mb3 collations hanging around, this change will instead break the code for those instances, in which case the best approach would be to modify those columns to utf8mb4.

xmacex commented 3 years ago

Hi. Should I make a PR out of the A local kludge fix above as a starting point, or something else?