cncf / devstats.archive

📈CNCF-created tool for analyzing and graphing developer contributions
https://devstats.cncf.io/
Apache License 2.0
445 stars 147 forks source link

[bug] [feature request] wrong developer username displayed if GitHub username was changed #384

Closed nbusseneau closed 1 year ago

nbusseneau commented 1 year ago

Hi, wasn't sure if I should flag this as bug or feature request since it's kind of an edge case, feel free to edit title.

I have noticed that my old GitHub username (skymirrh) shows up in devstats in place of my current one (nbusseneau). I suppose I'm not the only one in this case.

FWIW, I'm at least I'm 99% sure that when I changed it a while back, my GitHub user ID (as returned by https://api.github.com/users/{USERNAME}) did not change. So maybe this can be used as basis for reconnecting the dots when users change their name?

lukaszgryglicki commented 1 year ago

I will investigate when I can and will let you know.

lukaszgryglicki commented 1 year ago

Ok so I see your actors in the database:

allprj=# select * from gha_actors where login in ('nbusseneau', 'skymirrh');
         id          |   login    | name | country_id | sex | sex_prob | tz | tz_offset | country_name | age 
---------------------+------------+------+------------+-----+----------+----+-----------+--------------+-----
             4659919 | nbusseneau |      |            |     |          |    |           |              |    
 -879362899139435332 | skymirrh   |      | fr         |     |          |    |           | France       |    
(2 rows)

allprj=# select * from gha_actors_affiliations where actor_id in (select id from gha_actors where login in ('nbusseneau', 'skymirrh'));
      actor_id       | company_name | original_company_name |       dt_from       |        dt_to        | source 
---------------------+--------------+-----------------------+---------------------+---------------------+--------
 -879362899139435332 | Isovalent    | Isovalent             | 1900-01-01 00:00:00 | 2100-01-01 00:00:00 | manual
             4659919 | Isovalent    | Isovalent             | 1900-01-01 00:00:00 | 2100-01-01 00:00:00 | manual
(2 rows)

But I don't see the 2nd actor in github_users.json:

root@darkstar:~/go/src/github.com/cncf/gitdm/src# grep nbusseneau github_users.json 
root@darkstar:~/go/src/github.com/cncf/gitdm/src# 

I will add it there and see if that helps.

I also see that the previous GitHub profile is still there https://github.com/skymirrh and is not redirecting to https://github.com/nbusseneau - so they are different users from GitHub POV and also GHA POV. If you make contributions as a new user then it should show as a new one, but old contributions will remain on the old user. Also I see the old user has negative ID which means that thsi ID comes from very old contributions (before migration to the new format). I've checked allprj database (all projects combined). Once I update github_users.json - the next affiliation simport should include your new user.

On Kubernetes database (gha) I can only see the "new" user:

root@devstats-node-2:~/go/src/github.com/cncf/devstats-k8s-lf/util# k exec -itn devstats-prod devstats-postgres-3 -- psql gha
psql (13.1 (Debian 13.1-1.pgdg100+1))
Type "help" for help.

gha=# select * from gha_actors where login in ('nbusseneau', 'skymirrh');
   id    |   login    | name | country_id | sex | sex_prob | tz | tz_offset | country_name | age 
---------+------------+------+------------+-----+----------+----+-----------+--------------+-----
 4659919 | nbusseneau |      |            |     |          |    |           |              |    
(1 row)

gha=# select * from gha_actors_affiliations where actor_id in (select id from gha_actors where login in ('nbusseneau', 'skymirrh'));
 actor_id | company_name |       dt_from       |        dt_to        | original_company_name | source 
----------+--------------+---------------------+---------------------+-----------------------+--------
  4659919 | Isovalent    | 1900-01-01 00:00:00 | 2100-01-01 00:00:00 | Isovalent             | manual
(1 row)

Affiliations are also OK.

So can you LMK which projects you are looking for and maybe what exactly do you expect to be fixed? I think as long as we are processing GitHub archives then we get historical state at time when it was created (GitHub archives just pack all GitHub events from single hours, across all repos into a big zipped array of JSONs and then stores it. SO, if you chnaged name, say in 2021-01-01, then all archive files from before that date WILL have your previous name, and this is somethign that cannot be changed - we cannot chjange past. And new archives will have a new name - you are chaking against the CURRENT state at GitHub while devstats is constanctly processing historical data, and events have a state from the time when they were created.

All I can do now is to add your new ID to github_userrs.json file.

lukaszgryglicki commented 1 year ago

I've updated JSON file, it will be processed next time when affiliations import runs - this is different for each CNCF projects, so it will be gradually updated in all of them.

nbusseneau commented 1 year ago

I also see that the previous GitHub profile is still there https://github.com/skymirrh and is not redirecting to https://github.com/nbusseneau - so they are different users from GitHub POV and also GHA POV. If you make contributions as a new user then it should show as a new one, but old contributions will remain on the old user. Also I see the old user has negative ID which means that thsi ID comes from very old contributions (before migration to the new format). I've checked allprj database (all projects combined). Once I update github_users.json - the next affiliation simport should include your new user.

This assumption is incorrect: I can confirm that the user you had registered under skymirrh is the account currently named nbusseneau and NOT another account. The account currently named skymirrh is a new one I had created in order to maintain control over redirections for repos with GitHub Pages (there are edge cases in the usual GitHub repo redirection, used when an account name is changed, that made this necessary).

I apologize for not mentioning this in the OP, because I think this created a lot of confusion. I will try to clear up the confusion.

If you look at that new skymirrh account, you'll see it has 0 contributions whatsoever and its ID 80928860 is way more recent than the older nbusseneau (and former skymirrh)'s ID 4659919, and which is the ID you have registered in devstats. As per your data, all of the contributions registered in devstats are correctly associated to the nbusseneau account.

So what's the issue then?

As I was reporting in the OP, it seemed to me that devstats failed to adjust to the name change, because it displayed my old name skymirrh when I was looking at the dev activity for France.

SO, if you chnaged name, say in 2021-01-01, then all archive files from before that date WILL have your previous name, and this is somethign that cannot be changed - we cannot chjange past. And new archives will have a new name - you are chaking against the CURRENT state at GitHub while devstats is constanctly processing historical data, and events have a state from the time when they were created.

I think that is not the case, please correct me if the following theory does not hold: it looks to me like actors had properly picked up the new name for account 4659919. Even though skymirrh was renamed to nbusseneau, devstats still properly attaches contributions from nbusseneau to the skymirrh name (because it's the same user ID: the username was even properly updated in actors on name change).

So, I would say the issue is the following: if the user ID is the unique ID, then all is well because the GitHub username would not matter, devstats can just display whatever is the current user name. However this is evidently not the case, as even though the actors database had properly picked up the new name for account 4659919, devstats is still showing its contributions under the name skymirrh.

Now here's my theory: from your data, I think devstats already is capable of handling name changes (except maybe for github_users.json), because contributions are processed against GitHub user ID and not GitHub username. In other words: if a GitHub user changes their username, devstats properly picks up the change and updates actors accordingly. I would even suspect that devstats actually updates the username displayed (except maybe due to github_users.json).

The issue here is that my contributions are not only attached to my GitHub user ID, they are also attached to a bogus entry (the skymirrh with -879362899139435332 ID). It looks possible to me that the name displayed is actually the one from the negative ID entry, just because it would show up first in query results (due to being negative).

If this theory is correct, then the issue comes from the migration remnant data, and I think what needs to be done is make sure that the name associated to entries with a negative ID are ignored, in favor of the name associated to entries with an actual user ID (and maybe make sure github_users.json gets updated on name change, just like actors).

lukaszgryglicki commented 1 year ago

OK, just note that the data we process is from archives, and events registered there are snapshots from given point in time. Let's wait for affiliations import and if that is not helping I can try deleting the negative ID one and then regenrate data (in one smaller project for example to see if that helps).

lukaszgryglicki commented 1 year ago

in the meantime p,lease send me a link to example dashboard where data is bad and what you would expect to see there, so I will chekc dashboard SQL first and then eventually figure out what data should I have in tables to match your request, but I can only work on this at low priority due to other tasks that I'm currently working on.

nbusseneau commented 1 year ago

Link was in previous message, here it is again -- name displayed is skymirrh and should be nbusseneau ;)

nbusseneau commented 1 year ago

FWIW, I'm not too hung up about it, it's fine if it stays like that, so no worries. I was mostly submitting this in case there's a genuine bug -- which I now think there might be due to the legacy migration data.

lukaszgryglicki commented 1 year ago

I will investigate anyway, I need to know what happens internally in an edge case like this. Thanks for posting the link again, I overlooked it somehow.

nbusseneau commented 1 year ago

Thanks!