cpan-testers / cpantesters-web

A new CPAN Testers web application. The primary interface for CPAN Testers data
Other
7 stars 5 forks source link

Duplicated data on testers.address #7

Closed glasswalk3r closed 6 years ago

glasswalk3r commented 7 years ago

The table testers.address has several duplicated data in terms of testerid:

select testerid, count(testerid) as total
from testers.address
group by testerid
having count(testerid) > 2
order by count(testerid) desc;

910 19312
0   1021
20  60
34  53
247 47
27  32
102 32
72  31
530 27
8   24
32  24
392 20
42  20
108 19
26  19
52  18
367 18

10 row(s) returned

Currently, most of this data is being ignore since the query that makes use of it for view-report.cgi is the following:


SELECT mte.*,tp.* FROM metabase.testers_email mte \
LEFT JOIN testers.address ta ON ta.email=mte.email \
LEFT JOIN testers.profile tp ON tp.testerid=ta.testerid \
WHERE mte.resource=? ORDER BY tp.testerid DESC

Only the first row is used after all, so the query is inefficient. Currently I added LIMIT 1 to recover only the first row, but the DB is going over those registries anyway.

I'm not sure why those Id's got duplicated, but we are not using them for this scenario. Maybe we should be doing UPSERTS for those registries instead just to keep adding them. Even better, we should generate testers Id from the web page, to guarantee we won't have duplicated data.

Beware that the top level duplicated testerid's are somehow isolated, the vast majority of duplicates are much shorter than that:

foobar

barbie commented 7 years ago

Note that testers have and do use multiple (email) addresses. A tester id relates to the address they used to submit a report. The tester profile is the unique identity that the testers wishes to have displayed on the websites, e.g on the Statistics website.

Now the reason you may find there are 19k of one is that each address used by the tester is subtly different. e.g. "Barbie barbie@missbarbell.co.uk" and "Barbie CPAN Testers barbie@missbarbell.co.uk".

I did have a script that looks for some of these and merges them (see CPAN::Testers::Data::Addresses), but I haven't run it a long time.

Also the ones with a tester id of 0 are typically "root" and "admin" addresses that we can't resolve to the original testers.

glasswalk3r commented 7 years ago

Thanks for answering Barbie, I commented your answers below:

Note that testers have and do use multiple (email) addresses. A tester id relates to the address they used to submit a report. The tester profile is the unique identity that the testers wishes to have displayed on the websites, e.g on the Statistics website.

I believe this is another good reason for not keeping those duplicates. At least for this part of the website, we are using the latest one. We could eliminate all others and "upsert" the available registry if a different e-mail address is used.

Now the reason you may find there are 19k of one is that each address used by the tester is subtly different. e.g. "Barbie barbie@missbarbell.co.uk" and "Barbie CPAN Testers barbie@missbarbell.co.uk".

I guess that user is absurdly more creative than I... 19K of variations is not easy you know... I'll look again against those registries... most of the variations look like garbage to me.

I did have a script that looks for some of these and merges them (see CPAN::Testers::Data::Addresses), but I haven't run it a long time.

Thanks again... I guess it might be worth to execute it at least to reduce the numbers. On the other hand... I don't see why we can't keep a single record. Even more, I think we could got everything at the testers.profile table, even if we need to add another column over there.

Also the ones with a tester id of 0 are typically "root" and "admin" addresses that we can't resolve to the original testers.

I'm not sure if we should "fix" those ones somehow... maybe applying the same logic I described for the repeated testerid's.

preaction commented 7 years ago

Only one of the e-mails in the metabase.testers_email is used when creating the rest of the data in the cpanstats table. The most important field in that table is the resource field, which holds a GUID that they send to us in their reports. .

We do have lots of duplicate GUIDs, though I have already started solving that problem: @bingos asked that I look at it during the toolchain summit, and in order to have a new Metabase API, I need those users. The new table does not allow duplicate resource GUIDs, at the least: https://github.com/cpan-testers/cpantesters-schema/commit/451a8ac81b45fba27a89d43a6b1d2749129c97b8 and https://github.com/cpan-testers/cpantesters-backend/blob/master/lib/CPAN/Testers/Backend/Migrate/MetabaseUsers.pm and https://github.com/cpan-testers/cpantesters-api/blob/master/bin/cpantesters-legacy-metabase#L130-L151

But people are still allowed to have multiple e-mail addresses. We might eventually be able to clean up the parts that aren't the actual e-mail address, but I'm putting off changes like that until they can be made once, safely, cleanly, and only if absolutely necessary.

preaction commented 6 years ago

The new Metabase API uses a different table that deduplicates the testers by their Metabase user GUID. This is not always accurate, and problems will need to be fixed on a case-by-case basis, but this is all we can do for now.