cncf / devstats

📈CNCF-created tool for analyzing and graphing developer contributions
https://devstats.cncf.io
Apache License 2.0
61 stars 22 forks source link

[bug] Companies table and developers table have different data for the same period #51

Open craigbox opened 5 months ago

craigbox commented 5 months ago

@lukaszgryglicki has regenerated our database as of ~15 minutes ago, so this data is as fresh as it comes.

The companies table reports the top 5 contributors to Istio in the last 12 months as:

Rank Company Contributions
1 Google LLC 12738
2 Solo.io 8402
3 DaoCloud Network Technology Co. Ltd. 8229
4 International Business Machines Corporation 7461
5 Huawei Technologies Co. Ltd 6593

However, if one exports the data from the Developer activity counts by company view for the same period, the summation is this:

Rank Company Contributions
1 Google LLC 12615
2 Solo.io 8605
3 DaoCloud Network Technology Co. Ltd. 7936
4 International Business Machines Corporation 7509
5 Huawei Technologies Co. Ltd 6605

Note how some companies show fewer contributions in the second list, and some have more.

Istio uses this data as part of its governance process, and last week, the order of the top 5 results shown here actually differed depending on which metric you used.

Can you help us understand why these values are different?

lukaszgryglicki commented 5 months ago

This could be due to HLL (hyper log log) - I can change metrics (for Istio only) to use exact count distincts instead of approximate counts that HLL gives - but this will require creating custom SQLs just for Istio usage, can be done in a day or two, but not earlier than about week or two from now.

lukaszgryglicki commented 2 months ago

Can you recheck and LMK if this is still needed? I've optimised some metrics recently and they no longer use HLL, so this might be Ok already. If not LMK, I'll iterate on this when I can.

craigbox commented 2 months ago

Companies table:

Company Contributions
Solo.io 12905
Google LLC 9679
DaoCloud Network Technology Co. Ltd. 7427
International Business Machines Corporation 5980
Huawei Technologies Co. Ltd 5757
Microsoft Corporation 3182
Tetrate.io 2917
Ericsson 1448
Salesforce.com Inc. 1303
Red Hat Inc. 1141

Sum of developers table:

Company Sum of Contributions
Solo.io 12513
Google LLC 9518
DaoCloud Network Technology Co. Ltd. 7442
International Business Machines Corporation 5822
Huawei Technologies Co. Ltd 5375
Microsoft Corporation 3168
Tetrate.io 2981
Ericsson 1408
Salesforce.com Inc. 1309
Red Hat Inc. 1124

Closer, and current in the same order/ballpark.

(Edit: initial miscalculation around Google was my error.)

lukaszgryglicki commented 2 months ago

I will TAL on Friday or Monday.

craigbox commented 2 months ago

No rush, Istio doesn't need this until January!

lukaszgryglicki commented 2 months ago

Hmm the first link is giving sum of all contributions (this one) while another) is giving values per developer and you are summing them manually, right?

I'll check if both use HLL or doth don't use it - actually I will also update to use exact counts in case of Istio - because HLL was used to save cycles, but it makes more sense in All CNCF instance which has a lot of data, and here we can use just exact counts approach 9as Istio isn't as huge as All CNCF instance) - let me dive into it - maybe query conditions are slightly different on those two dashboards too?

lukaszgryglicki commented 2 months ago

One was using HLL while another not, I will sync them now and regenerate data, then I'll let you know when finished.

Also pls note that all statistics across DevStats are not calculated "on the fly" but synced at a given point in time and saved in tables (so later Grafana UI does just a simple select to those "calculated" tables) - if calculation for "last year" happened on different tome for two metrics - they can be slightly out of sync, but the difference shouldn't be hight - after this manual sync that I'll do now - they should be as close to each other as possible.

lukaszgryglicki commented 2 months ago

I've regenerated data, I don't have a script to sum all developers to check those value, PTAL again pls. Hope this is OK now.