cncf / devstats

📈CNCF-created tool for analyzing and graphing developer contributions
https://devstats.cncf.io
Apache License 2.0
61 stars 22 forks source link

[bug] Issues with Null employers in reports #47

Closed jberkus closed 2 months ago

jberkus commented 7 months ago

Currently, if we don't have a listed employer for a contributor, they're getting a null in the employer field. Not all reports handle the null correctly.

For example, compare these two charts:

Developer Activity Counts (Handles nulls)

Company PRs (does not handle them)

This can lead to very misleading assumptions about the data.

While the charts that don't handle Nulls could be patched to handle and display them, I think it would be a better fix to replace all those nulls, both backfill and at import time, with "Independent" or "Unknown". Let me know if you agree and I can work on a fix.

lukaszgryglicki commented 6 months ago

It's not like handing nulls.

We cannot just assume (Unknown) = Independent - independent is a separate type of "company-like" affiliation that says we know that those contributions are independent (not made by account to any company) - which is different that we don'rt know.

Those dashboards are like this "by design" and none of them is wrong IMHO - we just nee dto have more data, you can LMK which developers shoudl I researh and I will ask Justyna to check them to fill data gaps, but we will never have 100% affiliations data researched. This is always being researched. @jberkus

jberkus commented 6 months ago

Because we know that our affiliations data has major gaps, it's not acceptable to have charts that invisibly leave out people with unknown affiliation (which, at this point, is more than 50% of all CNCF contributors). As such, the design needs to be changed in some way. I think the easiest and clearest way would be to populate "Unknown".

lukaszgryglicki commented 6 months ago

Well this would need a discussion and approval from @caniszczyk - once we come into some conclusion - we can update dashboards (which ones shoudl be updated)? I still think that we should only update documentation (at the bottom of dashboards) - they were originally created like such - some to start from company and display informations from companies' POV and other starting at contributor and displaying informations from contributors' POV - but I'm open to any changes that are needed, as long as they are discussed and approved.

jberkus commented 6 months ago

Again, it would be a different story if our data on company affiliations was anywhere near complete. But it's not, and we can't make it better than 70% accurate (and right now it's about 40%).

lukaszgryglicki commented 6 months ago

Can you specify (by priority) what should be checked next (like - check unknown contributors fro all projects combined in last 3 months, etc.) in the meantime we can decide about dashboards updated needed - agian I'm OK with any decision that is made.

Also I don't think we are at 40% - I can ccheck that on Friday but please note:

jberkus commented 6 months ago

Let's take it up in the new year. Wanted to file the bug today so that I wouldn't forget about it for weeks.

jberkus commented 6 months ago

Also: the contributors/contributions thing could be right, I first noticed this issue when specifically looking at one-time contributors.

lukaszgryglicki commented 6 months ago

It is like this, I will provide an exact % of unknown contributors and contributiopns for all-time and across all cncf projects on Friday.

lukaszgryglicki commented 6 months ago

@jberkus I've generated the data PTAL at this: https://github.com/cncf/devstats-reports/commit/c2e01c85c00ffc2ef5427eb12b9c2da34df9a7ee

Also copied here the final findings:

We know about 17.95% - 17.96% of all contributors (note that this number will be much higer for contributions, for contributors there is a very long tail of contributors making 1-3 contributions that were not checked yet).

We know about 87.66% - 88.3% affilaitions of all contributions across all CNCF projects and across all time.

So as of 12/22/2023 we know company affiliations for about 88% of all contributions and we know about 18% of all contributors contributiong to all projects - and those a bit less than 20% contributors made almost 90% of all contributions across all time and all projects.
lukaszgryglicki commented 6 months ago

Same numbers of unknown contributors/contributions for Kubernetes are: 23% and 91%. For OpenTelemetry 26% and 94%.

I can check for any project and any date range if needed. We can also decide what should be checked next - like re-check top known contributors across all projects and all time to check if their affiliations changes - this may be the reason that most recent affs are off - we usually only look for top unknown contributors @jberkus

lukaszgryglicki commented 2 months ago

I'm closing this, please reopen if needed.