cncf / devstats.archive

📈CNCF-created tool for analyzing and graphing developer contributions
https://devstats.cncf.io/
Apache License 2.0
445 stars 147 forks source link

[bug]Contributions data are not correct #379

Closed kundan2707 closed 1 year ago

kundan2707 commented 1 year ago

There are discrepancy in data being reported on devstat.

For me like total number of contributions for 1.25 and now is less than my contribution in last month which is not correct

1 25 last_month
lukaszgryglicki commented 1 year ago

I will work on this on Friday or earlier if I find the time.

lukaszgryglicki commented 1 year ago

Fixed:

kundan2707 commented 1 year ago

@lukaszgryglicki one more issue detected. number of contribution for 1.25 increased for NEC corporation member but number of contributions is same for NEC corporation in v1.25-now

lukaszgryglicki commented 1 year ago

Not quite get this, what exactly is wrong, can you show an example? I've regenerated data so it should be up-to-date now. Not sure if I can track more today, but eventually on Friday.

kundan2707 commented 1 year ago
1 25_NEC updated_1 25_member

Sum of all members of NEC is not matching with total contributions of NEC corporation

lukaszgryglicki commented 1 year ago

OK those are other dashboards - I'll chekc them on Friday, also this doesn't need to match exactly because in all dashboards that show list of contributors we're truncating this list, not showing people with very small contributions conuts - this is to avoid keeping very long tails for all possible combinations of filters, DevStats is using snapshot approach to show data, so queries that run on dashboards are just selects from snapshots (without any extra joins and logic) to make it run quick. Also some dashboards can be synced at slightly different times, anyway, will check on Friday as I said.

lukaszgryglicki commented 1 year ago

Regenerated data and now companies table shows 24705 for NEC Corporation: click. Developer activity counts click gives sum: 24580 which is close but not strictly equal. Data exported as CSV and pasted here. I think this is good enough, considering that data generated for those dashboards were few days apart.

kundan2707 commented 1 year ago

@lukaszgryglicki but contribution data for members are still not updated. can you tell me frequency in which data for these dashboard got updated ?

lukaszgryglicki commented 1 year ago

It depends on time period, but I just updated data, so it should be up-to-date. If you see any contributor's contribution data incorrect - show me an example, I'll do some manual SQLs to see what I have in database.

kundan2707 commented 1 year ago

@lukaszgryglicki Can you regenerate dashboards related to contributor's contribution data and company contribution data , as it is not updated since many days?

lukaszgryglicki commented 1 year ago

I will do tomorrow.

lukaszgryglicki commented 1 year ago

Updating dashboards, will LYK when ready.

lukaszgryglicki commented 1 year ago

Still updating, K8s instance is huge, I'm regenerating all dashboards. Will let you know when ready.

lukaszgryglicki commented 1 year ago

Data regenerated.

kundan2707 commented 1 year ago

@lukaszgryglicki data showing for 1.25 to 1.26 for me kundan2707 is not correct. its value is much less than my actual contribution.

lukaszgryglicki commented 1 year ago

Which exact dashboard - I'll check source (GHA) and will provide you a number that I'm getting from the source. We need to first check if source number is OK. Or maybe you can provide an info what you expect to see (so I know how many contributions you are missing)mor maybe some specific type of contribution is missing? I really don't know what contributions you had and which are missing, I'm generating reports from data that is coming from GHA.

kundan2707 commented 1 year ago
Screenshot 2022-12-12 134142

I have done comments, reviews, PRs , Commits and all. Not very sure exactly which one is missing. but overall numbers in totals are lesser than actual contributions

lukaszgryglicki commented 1 year ago

so that would be hard to detect, I'll check GHA exact data and will get back to you.

lukaszgryglicki commented 1 year ago

OK so first I'm getting date range between releases v1.25 and v1.26:

gha=# select * from sannotations where title in ('v1.25.0', 'v1.26.0');
        time         | period |             description             |  title  
---------------------+--------+-------------------------------------+---------
 2022-08-23 17:00:00 |        | Kubernetes official release v1.25.0 | v1.25.0
 2022-12-08 19:00:00 |        | Kubernetes official release v1.26.0 | v1.26.0

This gives: 2022-08-23 - 2022-12-08.

Then I'm getting your actor data:

gha=# select * from gha_actors where login = 'kundan2707';
    id    |   login    |     name     | country_id | sex | sex_prob | tz | tz_offset | country_name | age 
----------+------------+--------------+------------+-----+----------+----+-----------+--------------+-----
 56246024 | kundan2707 | Kundan Kumar | in         |     |          |    |           | India        |    
(1 row)

Then I'm getting you company affiliations date-ranges:

gha=# select * from gha_actors_affiliations where actor_id = 56246024;
 actor_id |  company_name   |       dt_from       |        dt_to        | original_company_name | source 
----------+-----------------+---------------------+---------------------+-----------------------+--------
 56246024 | NEC Corporation | 1900-01-01 00:00:00 | 2100-01-01 00:00:00 | NEC Corporation       | user
(1 row)

So we know:

Now I'm checking the very bottom level source table - GHA (GitHub archives) events table:

gha=# select count(*) from gha_events where created_at >= '2022-08-23' and created_at <= '2022-12-08' and actor_id = 56246024;
 count 
-------
   185
(1 row)

thjere are 185 events for your user ID and given date range. Screenshot shows 184, dashboards are generated/updated periodically thay aren't always 100% up-to-date, may lag a bit behind but 184 out of 185 is a good match IMHO.

Anyway, no matter how I try I cannot get more events than those from GHA, I would consider this a bug if GHA events were, say, 300 and dashboard shows 184 - then I can dig my given dashboards SQLs to find why I discarded something, but here, dashboard is OK.

If you miss events - then LIST EXACTLY what you miss and when and I can file a bug against GHA for you and link to this issue.

lukaszgryglicki commented 1 year ago

Finally I'm dumping your events for you and LMK what is missing:

lukaszgryglicki commented 1 year ago

Also checked commits authored/committer:

gha=# select count(*) from gha_commits where dup_created_at >= '2022-08-23' and dup_created_at <= '2022-12-08' and author_id = 56246024;
 count 
-------
     2
(1 row)

gha=# select count(*) from gha_commits where dup_created_at >= '2022-08-23' and dup_created_at <= '2022-12-08' and committer_id = 56246024;
 count 
-------
     2
(1 row)
kundan2707 commented 1 year ago

https://github.com/kubernetes/website/pulls?q=is%3Acommit+author%3Akundan2707+created%3A2022-08-23..2022-12-08+ @lukaszgryglicki i have run query for even only one repo for same duration on github it is coming different.

lukaszgryglicki commented 1 year ago

I can see 7 PRs authored on this query, so when I filter what I already provided here by PullRequestEvent (PRs authored), I'm getting:

gha=# select * from gha_events where created_at >= '2022-08-23' and created_at <= '2022-12-08' and actor_id = 56246024 and type = 'PullRequestEvent';
     id      |       type       | actor_id |  repo_id  | public |     created_at      |  org_id  | forkee_id | dup_actor_login |                  dup_repo_name                  
-------------+------------------+----------+-----------+--------+---------------------+----------+-----------+-----------------+-------------------------------------------------
 24865639745 | PullRequestEvent | 56246024 | 105967469 | t      | 2022-10-27 11:43:49 | 13629408 |           | kundan2707      | kubernetes/ingress-gce
 25372572276 | PullRequestEvent | 56246024 |  51478266 | t      | 2022-11-21 11:41:39 | 13629408 |           | kundan2707      | kubernetes/website
 25445425251 | PullRequestEvent | 56246024 |  51478266 | t      | 2022-11-24 05:18:04 | 13629408 |           | kundan2707      | kubernetes/website
 25429392575 | PullRequestEvent | 56246024 |  51478266 | t      | 2022-11-23 13:17:17 | 13629408 |           | kundan2707      | kubernetes/website
 25486239077 | PullRequestEvent | 56246024 |  51478266 | t      | 2022-11-26 12:41:27 | 13629408 |           | kundan2707      | kubernetes/website
 25557618556 | PullRequestEvent | 56246024 |  51478266 | t      | 2022-11-30 08:16:32 | 13629408 |           | kundan2707      | kubernetes/website
 25639144437 | PullRequestEvent | 56246024 |  64782662 | t      | 2022-12-04 06:24:59 | 36015203 |           | kundan2707      | kubernetes-sigs/cluster-proportional-autoscaler
 25711538627 | PullRequestEvent | 56246024 |  64782662 | t      | 2022-12-07 07:06:39 | 36015203 |           | kundan2707      | kubernetes-sigs/cluster-proportional-autoscaler
 25711553364 | PullRequestEvent | 56246024 |  64782662 | t      | 2022-12-07 07:07:22 | 36015203 |           | kundan2707      | kubernetes-sigs/cluster-proportional-autoscaler
(9 rows)

So more - but this is for all repos, not just website. Now if ANYTHING is missing in this data then this is GHA issues - as this is a datasource for DevStats and I cannot have more than this.