Stats: Community package lists can collapse package groups to provide better visibility for projects

Issue

The package stats page https://www.nuget.org/stats/packages attempts to show the top 100 community packages. However many packages are distributed in groups, where if one of the popular packages at the root is installed it will bring the rest of the group with it. This creates a bias in the list towards popular packages that are broken down into individual components.

See this thread for more details: https://twitter.com/socketnorm/status/1203480289076375552

To Reproduce

Browse to: https://www.nuget.org/stats/packages

Expected behavior

I expect packages like xunit, serilog, swashbuckle, mongodb, and google.api to show up once per root package. So as Brad Wilson says, likely more than just one for xunit. Then I expect to be able to see a little more stats about other packages in that group and the ability to expand the entry to see all of its dependencies from the same group.

Why?

Because the current stat is both unfair, a single popular package takes a % space in the top 100 list based on how granularly it was broken down. Bringing other similarly popular package far down the list and stunting their exposure. I'm aware of my own bias as an AWS person, but note that that specific package is in the top 10 already, but packages like automapper, swashbuckle, are pushed below and so are a few others than would move above the fold.

Screenshots

Current

What it should look like

Excuse my possible mistakes in grouping some of the packages, I'm making an assumption, but the idea should come through regardless of mistakes

Current	New	Name	Count
1	1	newtonsoft.json	29,148,117
2	2	serilog	7,984,291
3	3	castle.core	7,886,842
4	4	moq	6,175,176
5	5	xunit ~~xunit.extensibility.core~~	5,245,389
6	X	~~xunit.abstractions~~	5,208,127
7	X	~~xunit.extensibility.execution~~	4,956,922
8	X	~~xunit.core~~	4,850,106
9	6	awssdk.core	4,844,711
10	X	~~xunit.assert~~	4,800,113
11	7	automapper	4,791,484
12	X	~~xunit~~	4,717,355
13	X	~~xunit.runner.visualstudio~~	4,529,696
14	X	~~xunit.analyzers~~	4,106,426
15	8	swashbuckle ~~swashbuckle.aspnetcore.swagger~~	3,896,177
16	X	~~swashbuckle.aspnetcore.swaggergen~~	3,840,631
17	9	nunit	3,775,539
18	X	~~swashbuckle.aspnetcore.swaggerui~~	3,686,688
19	X	~~swashbuckle.aspnetcore~~	3,390,697
20	10	polly	3,169,254
21	X	~~serilog.sinks.file~~	3,109,215
22	11	fluentassertions	3,060,040
23	12	nlog	3,059,236

How can it be done

I have some ideas about how to curate it, but I think the NuGet team is best equipped to come up with the best strategy in this case. You guys rock!

Thanks for the detailed write-up @yishaigalatzer! I agree with the sentiment here.

Trying to think of a possible quick-win here...

We already have a Dimension_PackageSet construct in the stats database. We mainly use it to apply the reverse filter: filter out non-community packages.

I'm wondering if we could use the same construct to group other packages into 'known sets' and enhance the DownloadReportRecentCommunityPopularity sproc that way...

Downside is that this grouping would also be manually curated (though only applies to the most popular packages), just like we do for the non-community packages that would otherwise appear in the list.

If we can build a query that generates the desired resultset, we'd need to update the report's JSON format, and the gallery's view that consumes this JSON.

Longer term, I think we should definitely take this feedback into account in a future redesign of the stats pipeline. Big question there would be: what data points are we missing (if any) to automate the grouping of these package sets? (thinking about telemetry difference between downloads and installs, direct installs versus transitive, and the like)

NuGet / NuGetGallery