Closed aliok closed 3 months ago
I'll work on this today, 1 and 2 points are clear, but I have problem with "passion" driven contributors
. I need to know what that means exactly, so I will just provide data for 1st and 2nd points today unless I have a clear definition for 3rd, 4th and 5th.
Oner more thing - I guess you want this for knative
project, right?
I'm guessing by this sentence:
How many of them were employed by the main vendors of Knative (Red Hat, Google, VMware, IBM, etc)
I also don't know what company do you mean by "main vendor" - but I will just count top N by number of contributions and contributors.
So for 1st request How many PR authors were there in 2022, 2023 and 2024
- you can see results here:
knative=# select extract(year from created_at) as year, count(distinct dup_actor_login) as pr_authors, count(distinct dup_repo_name || number) as prs from gha_issues where dup_actor_login not ilike all(select pattern from gha_bot_logins) and is_pull_request = true group by year order by year asc;
year | pr_authors | prs
------+------------+-------
2018 | 197 | 2238
2019 | 404 | 6381
2020 | 380 | 11787
2021 | 404 | 11787
2022 | 299 | 7671
2023 | 270 | 5541
2024 | 143 | 2591
(7 rows)
2nd point:
with vendor_data as (
select
aa.company_name as vendor,
count(distinct e.dup_actor_login) as contributors,
count(distinct e.id) as contributions
from
gha_events e
inner join
gha_actors_affiliations aa
on
e.actor_id = aa.actor_id
and e.created_at >= aa.dt_from
and e.created_at < aa.dt_to
and aa.company_name not in (
'(Robots)', 'Independent', 'CNCF'
)
where
e.type in (
'IssuesEvent', 'PullRequestEvent', 'PushEvent', 'CommitCommentEvent',
'IssueCommentEvent', 'PullRequestReviewCommentEvent', 'PullRequestReviewEvent'
)
and e.dup_actor_login not ilike all(
select
pattern
from
gha_bot_logins
)
group by
aa.company_name
), top_vendors_by_contributions as (
select
vendor
from
vendor_data
order by
contributions desc
limit
10
), top_vendors_by_contributors as (
select
vendor
from
vendor_data
order by
contributors desc
limit
10
), top_vendors as (
select
vendor
from
top_vendors_by_contributions
union
select
vendor
from
top_vendors_by_contributors
)
select
extract(year from i.created_at) as year,
v.vendor as vendor,
count(distinct i.dup_actor_login) as pr_authors,
count(distinct i.dup_repo_name || i.number) as prs
from
gha_issues i
inner join
gha_actors_affiliations aa
on
i.dup_actor_id = aa.actor_id
and i.created_at >= aa.dt_from
and i.created_at < aa.dt_to
and aa.company_name not in (
'(Robots)', 'Independent', 'CNCF'
)
inner join
top_vendors v
on
v.vendor = aa.company_name
where
i.dup_actor_login not ilike all(
select
pattern
from
gha_bot_logins
)
and i.is_pull_request = true
group by
year,
vendor
order by
year asc,
pr_authors desc
;
Gives:
year | vendor | pr_authors | prs
------+---------------------------------------------+------------+------
2018 | Google LLC | 61 | 1723
2018 | Pivotal | 21 | 361
2018 | Red Hat Inc. | 20 | 486
2018 | International Business Machines Corporation | 17 | 67
2018 | VMware Inc. | 7 | 66
2018 | SAP | 4 | 4
2018 | Microsoft Corporation | 4 | 37
2018 | CloudBees Inc. | 2 | 13
2018 | Cisco | 2 | 7
2018 | Amazon | 1 | 1
2018 | TriggerMesh Inc. | 1 | 2
2019 | Google LLC | 85 | 5166
2019 | Red Hat Inc. | 49 | 1725
2019 | International Business Machines Corporation | 33 | 705
2019 | Pivotal | 24 | 339
2019 | SAP | 11 | 75
2019 | VMware Inc. | 9 | 279
2019 | Microsoft Corporation | 7 | 9
2019 | CloudBees Inc. | 7 | 36
2019 | TriggerMesh Inc. | 2 | 93
2019 | Cisco | 2 | 15
2020 | Google LLC | 76 | 5539
2020 | Red Hat Inc. | 46 | 4713
2020 | International Business Machines Corporation | 31 | 1860
2020 | VMware Inc. | 25 | 4599
2020 | SAP | 11 | 110
2020 | Pivotal | 9 | 227
2020 | Microsoft Corporation | 4 | 1124
2020 | TriggerMesh Inc. | 4 | 204
2020 | Amazon | 1 | 1
2020 | Cisco | 1 | 1
2020 | Rackspace | 1 | 1
2020 | VMware Tanzu | 1 | 9
2021 | Google LLC | 53 | 2004
2021 | Red Hat Inc. | 53 | 6694
2021 | VMware Inc. | 36 | 3867
2021 | International Business Machines Corporation | 34 | 2787
2021 | SAP | 8 | 165
2021 | Microsoft Corporation | 5 | 532
2021 | TriggerMesh Inc. | 5 | 226
2021 | Amazon | 2 | 2
2021 | Pivotal | 2 | 255
2021 | VMware Tanzu | 1 | 353
2021 | Rackspace | 1 | 15
2021 | Cisco | 1 | 1
2022 | Red Hat Inc. | 37 | 3644
2022 | VMware Inc. | 26 | 2626
2022 | Google LLC | 25 | 642
2022 | International Business Machines Corporation | 23 | 2694
2022 | TriggerMesh Inc. | 5 | 44
2022 | Microsoft Corporation | 4 | 15
2022 | SAP | 3 | 21
2022 | Pivotal | 3 | 114
2022 | Amazon | 3 | 34
2022 | Rackspace | 1 | 258
2022 | Cisco | 1 | 1
2022 | CloudBees Inc. | 1 | 1
2023 | Red Hat Inc. | 30 | 3571
2023 | VMware Inc. | 19 | 1909
2023 | International Business Machines Corporation | 12 | 1184
2023 | Cisco | 4 | 86
2023 | Google LLC | 3 | 8
2023 | Pivotal | 3 | 56
2023 | Amazon | 3 | 11
2023 | SAP | 1 | 2
2023 | Rackspace | 1 | 102
2023 | TriggerMesh Inc. | 1 | 1
2024 | Red Hat Inc. | 21 | 1711
2024 | VMware Inc. | 12 | 875
2024 | International Business Machines Corporation | 2 | 396
2024 | Pivotal | 2 | 7
2024 | Cisco | 2 | 34
2024 | Microsoft Corporation | 1 | 20
2024 | Google LLC | 1 | 1
2024 | SAP | 1 | 1
(75 rows)
As CSV:
year,vendor,pr_authors,prs
2018,Google LLC,61,1723
2018,Pivotal,21,361
2018,Red Hat Inc.,20,486
2018,International Business Machines Corporation,17,67
2018,VMware Inc.,7,66
2018,SAP,4,4
2018,Microsoft Corporation,4,37
2018,CloudBees Inc.,2,13
2018,Cisco,2,7
2018,Amazon,1,1
2018,TriggerMesh Inc.,1,2
2019,Google LLC,85,5166
2019,Red Hat Inc.,49,1725
2019,International Business Machines Corporation,33,705
2019,Pivotal,24,339
2019,SAP,11,75
2019,VMware Inc.,9,279
2019,Microsoft Corporation,7,9
2019,CloudBees Inc.,7,36
2019,TriggerMesh Inc.,2,93
2019,Cisco,2,15
2020,Google LLC,76,5539
2020,Red Hat Inc.,46,4713
2020,International Business Machines Corporation,31,1860
2020,VMware Inc.,25,4599
2020,SAP,11,110
2020,Pivotal,9,227
2020,Microsoft Corporation,4,1124
2020,TriggerMesh Inc.,4,204
2020,Amazon,1,1
2020,Cisco,1,1
2020,Rackspace,1,1
2020,VMware Tanzu,1,9
2021,Google LLC,53,2004
2021,Red Hat Inc.,53,6694
2021,VMware Inc.,36,3867
2021,International Business Machines Corporation,34,2787
2021,SAP,8,165
2021,Microsoft Corporation,5,532
2021,TriggerMesh Inc.,5,226
2021,Amazon,2,2
2021,Pivotal,2,255
2021,VMware Tanzu,1,353
2021,Rackspace,1,15
2021,Cisco,1,1
2022,Red Hat Inc.,37,3644
2022,VMware Inc.,26,2626
2022,Google LLC,25,642
2022,International Business Machines Corporation,23,2694
2022,TriggerMesh Inc.,5,44
2022,Microsoft Corporation,4,15
2022,SAP,3,21
2022,Pivotal,3,114
2022,Amazon,3,34
2022,Rackspace,1,258
2022,Cisco,1,1
2022,CloudBees Inc.,1,1
2023,Red Hat Inc.,30,3571
2023,VMware Inc.,19,1909
2023,International Business Machines Corporation,12,1184
2023,Cisco,4,86
2023,Google LLC,3,8
2023,Pivotal,3,56
2023,Amazon,3,11
2023,SAP,1,2
2023,Rackspace,1,102
2023,TriggerMesh Inc.,1,1
2024,Red Hat Inc.,21,1711
2024,VMware Inc.,12,875
2024,International Business Machines Corporation,2,396
2024,Pivotal,2,7
2024,Cisco,2,34
2024,Microsoft Corporation,1,20
2024,Google LLC,1,1
2024,SAP,1,1
For points 3-5 I need extra feedback as requested here.
Thanks @lukaszgryglicki . I was on PTO yesterday, hence replying today.
"passion" driven contributors
These are people who are not affiliated with big vendors. List of these vendors change every year though. So, you're right at finding this unclear. I can't give you a list of companies myself.
So, just like you did, I would download a CSV file and mark some PR authors as "passion driven" in my local copy, based on their affiliation.
I think we're progressing here, thanks a lot for your hard work.
Can you provide me a table of this kind please?
| year | username | vendor | pr_count |
|------|----------|-------------|----------|
| 2018 | person1 | Google LLC | 25 |
| 2018 | person2 | Red Hat | 20 |
| ... | ... | ... | ... |
| 2024 | personN | Independent | 5 |
Then I can add a column in my local copy to mark a row "passion driven".
And, how can I run the queries you've provided earlier?
I will provide data on Friday, you cannot run queries yourselves - the DB is not accessible from outside the K8s cluster. You can kinda-run them when you open the dashboard properties and there is a read-only access to modify dashboard's query - so you can preview and execute queries, but you cannot apply changes as grafana user there has RO access to the database and dashboards.
Interestingly, just to chime in, I know at least one person at a "contributing vendor" who was contributing as a "passion contributor" for a year or two, because their main job was on something else, but they wanted to be involved in the project.
I don't know how many other CNCF projects are facing the same sort of shift from larger groups of mostly-paid contributors to a more even mix of paid and unpaid contributors, but it feels to me like there are substantial cultural shifts between the two modes.
I know at least one person at a "contributing vendor" who was contributing as a "passion contributor" for a year or two, because their main job was on something else, but they wanted to be involved in the project.
I think we can manually mark a person in our local copy as passion contributor even though being employed as a vendor when we have a table like this one: https://github.com/cncf/devstats/issues/67#issuecomment-2159923819
I don't know how many other CNCF projects are facing the same sort of shift from larger groups of mostly-paid contributors to a more even mix of paid and unpaid contributors, but it feels to me like there are substantial cultural shifts between the two modes.
This might be a good blog post, with our observations. We can ask other CNCF projects what their impressions are, if we can show how they can also fetch the same data for their project.
I will now work on providing data as requested here.
Data generated: 1) Top yearly vendors, generated by this SQL query: top_vendors_prs_by_years.csv. 2) Yearly passion PR authors (passion defined as not being affiliated to one of top vendors), generated by this SQL query: contributors_prs_by_years.csv
@lukaszgryglicki
Thanks a lot.
Yearly passion PR authors (passion defined as not being affiliated to one of top vendors), generated by this SQL query: contributors_prs_by_years.csv
This is great! I was able to run this on Grafana myself too, and was able to download the CSV myself as well. I will use this in the future.
Top yearly vendors, generated by this SQL query: top_vendors_prs_by_years.csv.
I couldn't run this myself though.
Any clues?
Fixed, thanks for reporting.
I was able to prepare this report using the data here. Thanks a lot.
cc @evankanderson
https://docs.google.com/spreadsheets/d/1T_nYBQ_6XapI-ZTwpvIQbbLQk1fn7kZrfnBTM90yZWc/edit#gid=461965269 (need to join Knative users Google Group)
Fixed, thanks for reporting.
Just gave it a try. It works.
Thanks a lot for your support here @lukaszgryglicki
Not sure if the work here can be generalized. But from my side, we can close this ticket.
I want to check out the list of PR authors for 2022, 2023 and 2024 separately.
I couldn't find a dashboard for that. There are some dashboards that show PR authors, but they don't allow selecting a date range (only allow selecting last N months/years).
I can get the data for last 3 years and last years, substract them to find the PR author data for [t-3 years, y-2 years] but I am curious if there's any better way.
My eventual goal is to find out these:
(original discussion https://github.com/cncf/devstats/discussions/66)