cncf / devstats

📈CNCF-created tool for analyzing and graphing developer contributions
https://devstats.cncf.io
Apache License 2.0
68 stars 25 forks source link

[feature request] Ability to find PR authors for date ranges #67

Closed aliok closed 3 months ago

aliok commented 4 months ago

I want to check out the list of PR authors for 2022, 2023 and 2024 separately.

I couldn't find a dashboard for that. There are some dashboards that show PR authors, but they don't allow selecting a date range (only allow selecting last N months/years).

I can get the data for last 3 years and last years, substract them to find the PR author data for [t-3 years, y-2 years] but I am curious if there's any better way.

My eventual goal is to find out these:

(original discussion https://github.com/cncf/devstats/discussions/66)

lukaszgryglicki commented 3 months ago

I'll work on this today, 1 and 2 points are clear, but I have problem with "passion" driven contributors. I need to know what that means exactly, so I will just provide data for 1st and 2nd points today unless I have a clear definition for 3rd, 4th and 5th.

lukaszgryglicki commented 3 months ago

Oner more thing - I guess you want this for knative project, right? I'm guessing by this sentence:

How many of them were employed by the main vendors of Knative (Red Hat, Google, VMware, IBM, etc)

I also don't know what company do you mean by "main vendor" - but I will just count top N by number of contributions and contributors.

lukaszgryglicki commented 3 months ago

So for 1st request How many PR authors were there in 2022, 2023 and 2024 - you can see results here:

knative=# select extract(year from created_at) as year, count(distinct dup_actor_login) as pr_authors, count(distinct dup_repo_name || number) as prs from gha_issues where dup_actor_login not ilike all(select pattern from gha_bot_logins) and is_pull_request = true group by year order by year asc;
 year | pr_authors |  prs  
------+------------+-------
 2018 |        197 |  2238
 2019 |        404 |  6381
 2020 |        380 | 11787
 2021 |        404 | 11787
 2022 |        299 |  7671
 2023 |        270 |  5541
 2024 |        143 |  2591
(7 rows)
lukaszgryglicki commented 3 months ago

2nd point:

with vendor_data as (
  select
    aa.company_name as vendor,
    count(distinct e.dup_actor_login) as contributors,
    count(distinct e.id) as contributions
  from
    gha_events e
  inner join
    gha_actors_affiliations aa
  on
    e.actor_id = aa.actor_id
    and e.created_at >= aa.dt_from
    and e.created_at < aa.dt_to
    and aa.company_name not in (
      '(Robots)', 'Independent', 'CNCF'
    )
  where
    e.type in (
      'IssuesEvent', 'PullRequestEvent', 'PushEvent', 'CommitCommentEvent',
      'IssueCommentEvent', 'PullRequestReviewCommentEvent', 'PullRequestReviewEvent'
    )
    and e.dup_actor_login not ilike all(
      select
        pattern
      from
        gha_bot_logins
    )
  group by
    aa.company_name
), top_vendors_by_contributions as (
  select
    vendor
  from
    vendor_data
  order by
    contributions desc
  limit
    10
), top_vendors_by_contributors as (
  select
    vendor
  from
    vendor_data
  order by
    contributors desc
  limit
    10
), top_vendors as (
  select
    vendor
  from
    top_vendors_by_contributions
  union
  select
    vendor
  from
    top_vendors_by_contributors
)
select
  extract(year from i.created_at) as year,
  v.vendor as vendor,
  count(distinct i.dup_actor_login) as pr_authors,
  count(distinct i.dup_repo_name || i.number) as prs
from
  gha_issues i
inner join
  gha_actors_affiliations aa
on
  i.dup_actor_id = aa.actor_id
  and i.created_at >= aa.dt_from
  and i.created_at < aa.dt_to
  and aa.company_name not in (
    '(Robots)', 'Independent', 'CNCF'
  )
inner join
  top_vendors v
on
  v.vendor = aa.company_name
where
  i.dup_actor_login not ilike all(
    select
      pattern
    from
      gha_bot_logins
  )
  and i.is_pull_request = true
group by
  year,
  vendor
order by
  year asc,
  pr_authors desc
;

Gives:

 year |                   vendor                    | pr_authors | prs  
------+---------------------------------------------+------------+------
 2018 | Google LLC                                  |         61 | 1723
 2018 | Pivotal                                     |         21 |  361
 2018 | Red Hat Inc.                                |         20 |  486
 2018 | International Business Machines Corporation |         17 |   67
 2018 | VMware Inc.                                 |          7 |   66
 2018 | SAP                                         |          4 |    4
 2018 | Microsoft Corporation                       |          4 |   37
 2018 | CloudBees Inc.                              |          2 |   13
 2018 | Cisco                                       |          2 |    7
 2018 | Amazon                                      |          1 |    1
 2018 | TriggerMesh Inc.                            |          1 |    2
 2019 | Google LLC                                  |         85 | 5166
 2019 | Red Hat Inc.                                |         49 | 1725
 2019 | International Business Machines Corporation |         33 |  705
 2019 | Pivotal                                     |         24 |  339
 2019 | SAP                                         |         11 |   75
 2019 | VMware Inc.                                 |          9 |  279
 2019 | Microsoft Corporation                       |          7 |    9
 2019 | CloudBees Inc.                              |          7 |   36
 2019 | TriggerMesh Inc.                            |          2 |   93
 2019 | Cisco                                       |          2 |   15
 2020 | Google LLC                                  |         76 | 5539
 2020 | Red Hat Inc.                                |         46 | 4713
 2020 | International Business Machines Corporation |         31 | 1860
 2020 | VMware Inc.                                 |         25 | 4599
 2020 | SAP                                         |         11 |  110
 2020 | Pivotal                                     |          9 |  227
 2020 | Microsoft Corporation                       |          4 | 1124
 2020 | TriggerMesh Inc.                            |          4 |  204
 2020 | Amazon                                      |          1 |    1
 2020 | Cisco                                       |          1 |    1
 2020 | Rackspace                                   |          1 |    1
 2020 | VMware Tanzu                                |          1 |    9
 2021 | Google LLC                                  |         53 | 2004
 2021 | Red Hat Inc.                                |         53 | 6694
 2021 | VMware Inc.                                 |         36 | 3867
 2021 | International Business Machines Corporation |         34 | 2787
 2021 | SAP                                         |          8 |  165
 2021 | Microsoft Corporation                       |          5 |  532
 2021 | TriggerMesh Inc.                            |          5 |  226
 2021 | Amazon                                      |          2 |    2
 2021 | Pivotal                                     |          2 |  255
 2021 | VMware Tanzu                                |          1 |  353
 2021 | Rackspace                                   |          1 |   15
 2021 | Cisco                                       |          1 |    1
 2022 | Red Hat Inc.                                |         37 | 3644
 2022 | VMware Inc.                                 |         26 | 2626
 2022 | Google LLC                                  |         25 |  642
 2022 | International Business Machines Corporation |         23 | 2694
 2022 | TriggerMesh Inc.                            |          5 |   44
 2022 | Microsoft Corporation                       |          4 |   15
 2022 | SAP                                         |          3 |   21
 2022 | Pivotal                                     |          3 |  114
 2022 | Amazon                                      |          3 |   34
 2022 | Rackspace                                   |          1 |  258
 2022 | Cisco                                       |          1 |    1
 2022 | CloudBees Inc.                              |          1 |    1
 2023 | Red Hat Inc.                                |         30 | 3571
 2023 | VMware Inc.                                 |         19 | 1909
 2023 | International Business Machines Corporation |         12 | 1184
 2023 | Cisco                                       |          4 |   86
 2023 | Google LLC                                  |          3 |    8
 2023 | Pivotal                                     |          3 |   56
 2023 | Amazon                                      |          3 |   11
 2023 | SAP                                         |          1 |    2
 2023 | Rackspace                                   |          1 |  102
 2023 | TriggerMesh Inc.                            |          1 |    1
 2024 | Red Hat Inc.                                |         21 | 1711
 2024 | VMware Inc.                                 |         12 |  875
 2024 | International Business Machines Corporation |          2 |  396
 2024 | Pivotal                                     |          2 |    7
 2024 | Cisco                                       |          2 |   34
 2024 | Microsoft Corporation                       |          1 |   20
 2024 | Google LLC                                  |          1 |    1
 2024 | SAP                                         |          1 |    1
(75 rows)

As CSV:

year,vendor,pr_authors,prs
2018,Google LLC,61,1723
2018,Pivotal,21,361
2018,Red Hat Inc.,20,486
2018,International Business Machines Corporation,17,67
2018,VMware Inc.,7,66
2018,SAP,4,4
2018,Microsoft Corporation,4,37
2018,CloudBees Inc.,2,13
2018,Cisco,2,7
2018,Amazon,1,1
2018,TriggerMesh Inc.,1,2
2019,Google LLC,85,5166
2019,Red Hat Inc.,49,1725
2019,International Business Machines Corporation,33,705
2019,Pivotal,24,339
2019,SAP,11,75
2019,VMware Inc.,9,279
2019,Microsoft Corporation,7,9
2019,CloudBees Inc.,7,36
2019,TriggerMesh Inc.,2,93
2019,Cisco,2,15
2020,Google LLC,76,5539
2020,Red Hat Inc.,46,4713
2020,International Business Machines Corporation,31,1860
2020,VMware Inc.,25,4599
2020,SAP,11,110
2020,Pivotal,9,227
2020,Microsoft Corporation,4,1124
2020,TriggerMesh Inc.,4,204
2020,Amazon,1,1
2020,Cisco,1,1
2020,Rackspace,1,1
2020,VMware Tanzu,1,9
2021,Google LLC,53,2004
2021,Red Hat Inc.,53,6694
2021,VMware Inc.,36,3867
2021,International Business Machines Corporation,34,2787
2021,SAP,8,165
2021,Microsoft Corporation,5,532
2021,TriggerMesh Inc.,5,226
2021,Amazon,2,2
2021,Pivotal,2,255
2021,VMware Tanzu,1,353
2021,Rackspace,1,15
2021,Cisco,1,1
2022,Red Hat Inc.,37,3644
2022,VMware Inc.,26,2626
2022,Google LLC,25,642
2022,International Business Machines Corporation,23,2694
2022,TriggerMesh Inc.,5,44
2022,Microsoft Corporation,4,15
2022,SAP,3,21
2022,Pivotal,3,114
2022,Amazon,3,34
2022,Rackspace,1,258
2022,Cisco,1,1
2022,CloudBees Inc.,1,1
2023,Red Hat Inc.,30,3571
2023,VMware Inc.,19,1909
2023,International Business Machines Corporation,12,1184
2023,Cisco,4,86
2023,Google LLC,3,8
2023,Pivotal,3,56
2023,Amazon,3,11
2023,SAP,1,2
2023,Rackspace,1,102
2023,TriggerMesh Inc.,1,1
2024,Red Hat Inc.,21,1711
2024,VMware Inc.,12,875
2024,International Business Machines Corporation,2,396
2024,Pivotal,2,7
2024,Cisco,2,34
2024,Microsoft Corporation,1,20
2024,Google LLC,1,1
2024,SAP,1,1

vendors_by_year.csv

lukaszgryglicki commented 3 months ago

For points 3-5 I need extra feedback as requested here.

aliok commented 3 months ago

Thanks @lukaszgryglicki . I was on PTO yesterday, hence replying today.

"passion" driven contributors

These are people who are not affiliated with big vendors. List of these vendors change every year though. So, you're right at finding this unclear. I can't give you a list of companies myself.

So, just like you did, I would download a CSV file and mark some PR authors as "passion driven" in my local copy, based on their affiliation.

I think we're progressing here, thanks a lot for your hard work.

Can you provide me a table of this kind please?

| year | username | vendor      | pr_count |
|------|----------|-------------|----------|
| 2018 | person1  | Google LLC  | 25       |
| 2018 | person2  | Red Hat     | 20       |
| ...  | ...      | ...         | ...      |
| 2024 | personN  | Independent | 5        |

Then I can add a column in my local copy to mark a row "passion driven".

And, how can I run the queries you've provided earlier?

lukaszgryglicki commented 3 months ago

I will provide data on Friday, you cannot run queries yourselves - the DB is not accessible from outside the K8s cluster. You can kinda-run them when you open the dashboard properties and there is a read-only access to modify dashboard's query - so you can preview and execute queries, but you cannot apply changes as grafana user there has RO access to the database and dashboards.

evankanderson commented 3 months ago

Interestingly, just to chime in, I know at least one person at a "contributing vendor" who was contributing as a "passion contributor" for a year or two, because their main job was on something else, but they wanted to be involved in the project.

I don't know how many other CNCF projects are facing the same sort of shift from larger groups of mostly-paid contributors to a more even mix of paid and unpaid contributors, but it feels to me like there are substantial cultural shifts between the two modes.

aliok commented 3 months ago

I know at least one person at a "contributing vendor" who was contributing as a "passion contributor" for a year or two, because their main job was on something else, but they wanted to be involved in the project.

I think we can manually mark a person in our local copy as passion contributor even though being employed as a vendor when we have a table like this one: https://github.com/cncf/devstats/issues/67#issuecomment-2159923819

I don't know how many other CNCF projects are facing the same sort of shift from larger groups of mostly-paid contributors to a more even mix of paid and unpaid contributors, but it feels to me like there are substantial cultural shifts between the two modes.

This might be a good blog post, with our observations. We can ask other CNCF projects what their impressions are, if we can show how they can also fetch the same data for their project.

lukaszgryglicki commented 3 months ago

I will now work on providing data as requested here.

lukaszgryglicki commented 3 months ago

Data generated: 1) Top yearly vendors, generated by this SQL query: top_vendors_prs_by_years.csv. 2) Yearly passion PR authors (passion defined as not being affiliated to one of top vendors), generated by this SQL query: contributors_prs_by_years.csv

aliok commented 3 months ago

@lukaszgryglicki

Thanks a lot.

Yearly passion PR authors (passion defined as not being affiliated to one of top vendors), generated by this SQL query: contributors_prs_by_years.csv

This is great! I was able to run this on Grafana myself too, and was able to download the CSV myself as well. I will use this in the future.

Screenshot 2024-06-14 at 12 53 15

Top yearly vendors, generated by this SQL query: top_vendors_prs_by_years.csv.

I couldn't run this myself though.

Screenshot 2024-06-14 at 12 55 52

Any clues?

lukaszgryglicki commented 3 months ago

Fixed, thanks for reporting.

aliok commented 3 months ago

I was able to prepare this report using the data here. Thanks a lot.

cc @evankanderson

https://docs.google.com/spreadsheets/d/1T_nYBQ_6XapI-ZTwpvIQbbLQk1fn7kZrfnBTM90yZWc/edit#gid=461965269 (need to join Knative users Google Group)

report.pdf

aliok commented 3 months ago

Fixed, thanks for reporting.

Just gave it a try. It works.

Thanks a lot for your support here @lukaszgryglicki

Not sure if the work here can be generalized. But from my side, we can close this ticket.