Open xmacex opened 3 years ago
A cool iridescent bug landed on my keyboard just as I finished the debugging session for today
The open question is what are possible, expected and preferred values of that
Collation
column, ie. what should alternatives should be listed on the line 869 to satisfy the proposition that$is_utf8mb4 = true
, or should the default on line L863 be something that covers a set of differentutf8mb4*
collations rather thenutf8_bin
? My expertise on MySQL collation is weak.
On the current MySQL 8 release utf8
is still an alias for utf8mb3
, so utf8_bin
doesn't actually cover any of the utf8mb4*
collations and the code should default to utf8mb4_bin
instead. The problem I see here is that if there's any utf8mb3
collations hanging around, this change will instead break the code for those instances, in which case the best approach would be to modify those columns to utf8mb4
.
Hi. Should I make a PR out of the A local kludge fix above as a starting point, or something else?
Describe the bug
At least one of the visualization functionalities, Bipartite hashtag-mention graph implemented by
mod.mention_graph.php
fails and produces no output for the user.To Reproduce
Steps to reproduce the behavior:
Expected behavior
A
.gexf
network file to download should be offered.Screenshots
Environment:
Additional context
Seems to be collation issue. From Apache log:
Forensic analysis via the code to the database schema
Looking at where the error is coming from https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/mod.mention_graph.php#L56
we see that the SQL query is https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/mod.mention_graph.php#L46
The variable
collation
is retrieved a little bit before https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/mod.mention_graph.php#L33The function
current_collation()
is defined at https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/common/functions.php#L853In there we find first that
https://github.com/digitalmethodsinitiative/dmi-tcat/blob/9654fe3ff489fd3b0efc6ddcf7c19adf8ed7726d/analysis/common/functions.php#L863
and then on lines 868-873
So, the collation for the visualization module defaults to
utf8_bin
if the*_hashtags
table does not have collation ofutf8mb4_unicode_ci
orutf8mb4_general_ci
for any of the rows.Now looking back at the Apache error message
1253 COLLATION 'utf8_bin' is not valid for CHARACTER SET 'utf8mb4'
, it seems that thatutf8_bin
default is used, but it's not appropriate for the table and therefore the hickup.Let's investigate the
describes the table as
Right, either null of utf8mb4_0900_ai_ci collation. TCAT collation defaults to
utf8_bin
and isn't valid forutf8mb4*
as Apache says.A local fix kludge
If I add
utf8mb4_0900_ai_ci
to the line 869 of/analysis/common/functions.php
by making it(reflowed for legibility)
the visualization module works as expected, yielding a .gexf file and no errors in Apache logs. This kludge fixes it for that bin and for that function. I haven't had the change to investigate what are all the possible collations our database has for those
%_hashtag
columns, and where do they come from originally and if the migration scripts do something to them. There might be various collations, we've been running TCAT for some years.Suggestions for fixes
The open question is what are possible, expected and preferred values of that
Collation
column, ie. what should alternatives should be listed on the line 869 to satisfy the proposition that$is_utf8mb4 = true
, or should the default on line L863 be something that covers a set of differentutf8mb4*
collations rather thenutf8_bin
? My expertise on MySQL collation is weak.Does someone else have this? Please don't hesitate to ask for clarifications or otherwise collaborate for sorting¹ this out.
Have a good day ✨
¹ pun intended