cpan-testers / cpantesters-web

A new CPAN Testers web application. The primary interface for CPAN Testers data
Other
7 stars 5 forks source link

Case insensitive collation on Mysql can lead to incorrect results #6

Closed glasswalk3r closed 6 years ago

glasswalk3r commented 7 years ago

While migrating data from Mysql to SQLite to enable more autonomous unit testing, I noticed the following on Mysql DB:

select * from testers.address where email = 'DCOLLINS@cpan.org'
union all
select * from testers.address where email = 'dcollins@cpan.org';

2303    530 DCOLLINS@cpan.org   dcollins@cpan.org
2303    530 DCOLLINS@cpan.org   dcollins@cpan.org
2 row(s) returned

That happens because the table testers.address is set to use utf8 case insensitive collation. On SQLite3 such results cannot be reproduced and I had to use lower on the respective queries to have the same results between the databases.

One real example where the current configuration can have problems is the following query:

SELECT mte.email, tp.name, tp.pause, tp.contact, ta.*
FROM metabase.testers_email mte 
LEFT JOIN testers.address ta ON ta.email=mte.email 
LEFT JOIN testers.profile tp ON tp.testerid=ta.testerid 
WHERE mte.resource='metabase:user:a0e144f2-373b-11e6-8edd-c6848980b093'
ORDER BY tp.testerid DESC
limit 1;

The LEFT JOIN on testers.address will fail on SQLite3, and the corresponding unit test 01-cpantesters-web-legacy-html.t will fail as well.

preaction commented 7 years ago

I don't understand what part of the website needs to match e-mail addresses case-insensitively, especially since e-mail addresses are case-sensitive.

barbie commented 7 years ago

I think you mean the other way around :) We want to match all cases, because in the real world email address are case-insensitive. However, when testers submit their address they can be in any case they choose, so we want to match 'dcollins@cpan.org' against any case match we find in the DB, as we know they are the same tester.

These tables are mostly used by the Statistic site.

preaction commented 7 years ago

I mostly just don't know what this ticket's goal is: What is the problem for the end-user, how can it be fixed, and what must happen for this ticket to be completed/closed?

glasswalk3r commented 7 years ago

I mostly just don't know what this ticket's goal is

Maybe it is my English that is not good enough, but an issue is usually reported when someone sees a problem. If it needs to be fixed or it was a design decision, that's another matter. In the worst case this goes into a FAQ or something like.

What is the problem for the end-user?

Taking long cycles to see reports for recent releases on CPAN or wait a few seconds to see a report finish, all that because we have performance bottlenecks over the way. For the end user, the impact is low, but we can improve the DB performance a bit with those changes and making the database related code more portable.

how can it be fixed?

We can simply lower all e-mail address before saving in the database. It will be done a single time and I can't see impacts for the end user.

Or we can lower when comparisons are needed. Doing that directly on the DB is not portable between databases as I already demonstrated.

what must happen for this ticket to be completed/closed?

Implement one of the two choices that I suggested, proposing a third one or consider that this is not an issue for the project (and accepting that testing with SQLite will be a problem).

preaction commented 6 years ago

This will have to be solved in the schema, if it can be solved at all. E-mail addresses are case-sensitive, but it's a problematic system that allows e-mail accounts to differ only by case (so I haven't seen one that does it). We also can't easily fix this just by changing the collation on the table: That might break a whole bunch of things that are currently working (relying on inaccurate joins to fetch the data they require). So, for now, the solution is potentially more dangerous than leaving the problem as it is.