cncf / devstats.archive

📈CNCF-created tool for analyzing and graphing developer contributions
https://devstats.cncf.io/
Apache License 2.0
445 stars 147 forks source link

[bug] User data duplication #385

Closed kerthcet closed 1 year ago

kerthcet commented 1 year ago
image

No.43 was the user name I used before, but I already removed from the gitdm repo, still found the user here. How to remove this? Thanks.

Link: https://k8s.devstats.cncf.io/d/55/company-prs-in-repository-groups?orgId=1&var-period_name=Last%20year&var-repogroups=All&var-repos=All&var-companies=All&var-countries=All

kerthcet commented 1 year ago

Besides, I found the commits number not right, I have merged more than 80 PRs, but only got 5 commits.

image

Link: https://k8s.devstats.cncf.io/d/66/developer-activity-counts-by-companies?orgId=1&var-period_name=Last%20year&var-metric=commits&var-repogroup_name=Kubernetes&var-repo_name=kubernetes&var-country_name=All&var-companies=All

lukaszgryglicki commented 1 year ago

I will take a look, but first things - removing freom gitdm only removes company affiliations, not user. And DevStats is processing historical data (from GitHub archives) any event happened on GitHub is in GitHub archives with its state from the hour when it happened, so historical events will have your old user name, from before the change, I'll investigate leter and let youn know.

lukaszgryglicki commented 1 year ago

Found bug for your 1st case, will be regenerating metrics, but now researching 2nd so I can update both at the same time.

lukaszgryglicki commented 1 year ago

For the 2nd you are looking at commit and this is 5. When you change metric drop-down to PRs you will see 79 (which is almost the value you are asking for, it will be more up to date when I regenerate this dashboard, yearly dashboards are not regenerating every daye, we don't have enough resource power to do so). Anyway, I'm looking on your commit count 5 now to see if this is correct with bottom level source data.

kerthcet commented 1 year ago

Thanks @lukaszgryglicki for the quick response, then I understand why only 5 commits there. Kindly suggestions, maybe we should add some notes to notify users(about the update frequency) like me not be confused again. And yearly might be too long for an online dashboard, but you know much more than me about this project, you're the boss. :)

lukaszgryglicki commented 1 year ago

So for the 2nd issue, you are author, actor or committer in 54 commits overall across all time & all Kubernetes repos:

gha=# select * from gha_actors where login in ('kerthcet', 'yaphetsglhf');
    id    |    login    | name | country_id | sex | sex_prob | tz | tz_offset | country_name | age 
----------+-------------+------+------------+-----+----------+----+-----------+--------------+-----
 18364341 | kerthcet    |      | cn         |     |          |    |           | China        |    
 18364341 | yaphetsglhf |      | cn         |     |          |    |           | China        |    
(2 rows)

gha=# select * from gha_actors_affiliations where actor_id = 18364341;
 actor_id |             company_name             |       dt_from       |        dt_to        |        original_company_name         | source 
----------+--------------------------------------+---------------------+---------------------+--------------------------------------+--------
 18364341 | DaoCloud Network Technology Co. Ltd. | 1900-01-01 00:00:00 | 2100-01-01 00:00:00 | DaoCloud Network Technology Co. Ltd. | user
(1 row)

gha=# select count(distinct sha) from gha_commits where dup_actor_id = 18364341 or author_id = 18364341 or committer_id = 18364341;
 count 
-------
    54
(1 row)

gha=# 

This means all repos, all companie (you only have one so this doesn't count) and all time. Your selected dashboard is for the last year and repository_group Kubernetes so we need to limit those commits to thios scope: If we limit to last year we are getting 52 commits:

gha=# select count(distinct sha) from gha_commits where (dup_actor_id = 18364341 or author_id = 18364341 or committer_id = 18364341) and dup_created_at >= now() - '1 year'::interval;
 count 
-------
    52
(1 row)

if we limit to Kubernetes repo group, we are getting just 6 commits (Kubernetes repo group is just k/k repo including all its historical renames):

gha=# select count(distinct c.sha) from gha_commits c, gha_repos r where (c.dup_actor_id = 18364341 or c.author_id = 18364341 or c.committer_id = 18364341) and c.dup_created_at >= now() - '1 year'::interval and c.dup_repo_id = r.id and c.dup_repo_name = r.name and r.repo_group = 'Kubernetes';
 count 
-------
     6
(1 row)

dashboard shows 5 currently as yearly dashboards aren't refreshed that often, we don't have computing resources to do that daily or so, anyway I'll refresh that dashboard too - you will get 6 instead of 5. If you want to see all your commits (not PRs as in your link) you can see them via: click - it shows 51 currently (and most up to date value is 52).

Note that you can play with:

lukaszgryglicki commented 1 year ago

I'm now regenerating data for two mentioned dashboards so they will be up-to-date with today's data.

kerthcet commented 1 year ago

So for the 2nd issue, you are author, actor or committer in 54 commits overall across all time & all Kubernetes repos:

Then something weird, I merged 78 PRs up to now, how only 54 commits? See the PR stats here: https://github.com/kubernetes/kubernetes/pulls?q=is%3Apr+author%3Akerthcet+is%3Amerged

lukaszgryglicki commented 1 year ago

PR != commit. This is the data I have in the commits table that comes from GHA.

gha=# select count(distinct sha) from gha_commits where dup_actor_id = 18364341 or author_id = 18364341 or committer_id = 18364341;
 count 
-------
    54
(1 row)

There is nothing I can do with this data, ifd there is an error with dashboard query etc - then I can do something, but I have no way to modify source data in commits table - this is something that I get from data source (GitHub archives).

lukaszgryglicki commented 1 year ago

BTW: for PRs I'm getting 190 PRs that were opened by you (maybe not all merged?):

gha=# select count(distinct id) from gha_pull_requests where user_id = 18364341;
 count 
-------
   190
(1 row)

Also I see 101 PRs opened by you in k/k repo (Kubernetes repo group):

gha=# select count(distinct pr.id) from gha_pull_requests pr, gha_repos r where pr.user_id = 18364341 and pr.dup_repo_id = r.id and pr.dup_repo_name = r.name and r.repo_group = 'Kubernetes';
 count 
-------
   101
(1 row)

But PRs are then merged by bot? Maybe bot is a committer then, not you, maybe you just have 6 direct commits that you made to the repo yourselfg, not bot merging your PRs...

lukaszgryglicki commented 1 year ago

Also your GitHub link & comment:

Then something weird, I merged 78 PRs up to now, how only 54 commits? See the PR stats here:

This dashboard is NOT showing how many PRs were merged by you - it shows Opened PRs and Commits so you are comparing apples to bananas.

lukaszgryglicki commented 1 year ago

There is also a Merged PRs metric here and it shows you merged 125 PRs across all k8s repos and 63 in k/k (both for the last year).

Please play with: metric, repository group and range drop-downs.

lukaszgryglicki commented 1 year ago

Finally note that GHA had a few outages in the past, including one major outage which caused it didn't record GitHub events for a few days or maybe even a week. SO data here might be sometimes slightly off, not much, but a bit. There is nothing I can do and GHA also cannot back-fill this data as it was in the past.

kerthcet commented 1 year ago

Also I see 101 PRs opened by you in k/k repo (Kubernetes repo group):

Yes I opend 101 PRs so far, most of them are merged, some are closed, and some are still open.

But PRs are then merged by bot? Maybe bot is a committer then, not you, maybe you just have 6 direct commits that you made to the repo yourselfg, not bot merging your PRs...

I don't know how it runs, but I think there's something wrong. Because my colleagues also work on the k/k repo, but his statistics is right.

kerthcet commented 1 year ago

I'll take a look of what you pasted, thanks a lot.

kerthcet commented 1 year ago

I guess change the username in .gitconfig wouldn't influence the result.

lukaszgryglicki commented 1 year ago
I guess change the username in .gitconfig wouldn't influence the result.

Yes, this won't change anything AFAIK.

lukaszgryglicki commented 1 year ago

Dashboards data regenerated, you can recheck, note that I cannot have data other than source data from GHA which was already displayed here.