Closed jberkus closed 2 months ago
It's not like handing nulls.
1st chart lists developers by contributions or other metrics and displays whatever company affiliation they were made for (or no company) - it is a "developer based" report, so I woudl say (in sql
jargon) it left joins
with company, so every contributions will be listed.
2nd chart is based on company (starts from company) so it doesn't consider unknown affiliations (contributions made by developer while their affiliations were unknown).
We cannot just assume (Unknown)
= Independent
- independent is a separate type of "company-like" affiliation that says we know
that those contributions are independent (not made by account to any company) - which is different that we don'rt know.
Those dashboards are like this "by design" and none of them is wrong IMHO - we just nee dto have more data, you can LMK which developers shoudl I researh and I will ask Justyna to check them to fill data gaps, but we will never have 100% affiliations data researched. This is always being researched. @jberkus
Because we know that our affiliations data has major gaps, it's not acceptable to have charts that invisibly leave out people with unknown affiliation (which, at this point, is more than 50% of all CNCF contributors). As such, the design needs to be changed in some way. I think the easiest and clearest way would be to populate "Unknown".
Well this would need a discussion and approval from @caniszczyk - once we come into some conclusion - we can update dashboards (which ones shoudl be updated)? I still think that we should only update documentation (at the bottom of dashboards) - they were originally created like such - some to start from company and display informations from companies' POV and other starting at contributor and displaying informations from contributors' POV - but I'm open to any changes that are needed, as long as they are discussed and approved.
Again, it would be a different story if our data on company affiliations was anywhere near complete. But it's not, and we can't make it better than 70% accurate (and right now it's about 40%).
Can you specify (by priority) what should be checked next (like - check unknown contributors fro all projects combined in last 3 months, etc.) in the meantime we can decide about dashboards updated needed - agian I'm OK with any decision that is made.
Also I don't think we are at 40% - I can ccheck that on Friday but please note:
contributors
from all projects, all time is possible
because the long tail of 1-contribution people alone can be > 60% of all contributors.contributions
we are (I think - I can check) at least 85%-90% good - we are researching affiliation starting from most-contributing contributors.Let's take it up in the new year. Wanted to file the bug today so that I wouldn't forget about it for weeks.
Also: the contributors/contributions thing could be right, I first noticed this issue when specifically looking at one-time contributors.
It is like this, I will provide an exact % of unknown contributors and contributiopns for all-time and across all cncf projects on Friday.
@jberkus I've generated the data PTAL at this: https://github.com/cncf/devstats-reports/commit/c2e01c85c00ffc2ef5427eb12b9c2da34df9a7ee
Also copied here the final findings:
We know about 17.95% - 17.96% of all contributors (note that this number will be much higer for contributions, for contributors there is a very long tail of contributors making 1-3 contributions that were not checked yet).
We know about 87.66% - 88.3% affilaitions of all contributions across all CNCF projects and across all time.
So as of 12/22/2023 we know company affiliations for about 88% of all contributions and we know about 18% of all contributors contributiong to all projects - and those a bit less than 20% contributors made almost 90% of all contributions across all time and all projects.
Same numbers of unknown contributors/contributions for Kubernetes are: 23% and 91%. For OpenTelemetry 26% and 94%.
I can check for any project and any date range if needed.
We can also decide what should be checked next - like re-check top known
contributors across all projects and all time to check if their affiliations changes - this may be the reason that most recent affs are off - we usually only look for top unknown contributors @jberkus
I'm closing this, please reopen if needed.
Currently, if we don't have a listed employer for a contributor, they're getting a null in the employer field. Not all reports handle the null correctly.
For example, compare these two charts:
Developer Activity Counts (Handles nulls)
Company PRs (does not handle them)
This can lead to very misleading assumptions about the data.
While the charts that don't handle Nulls could be patched to handle and display them, I think it would be a better fix to replace all those nulls, both backfill and at import time, with "Independent" or "Unknown". Let me know if you agree and I can work on a fix.