NuGet / NuGetGallery

NuGet Gallery is a package repository that powers https://www.nuget.org. Use this repo for reporting NuGet.org issues.
https://www.nuget.org/
Apache License 2.0
1.54k stars 643 forks source link

[Feature]: Advanced statistics for each package release #9353

Open tonyqus opened 1 year ago

tonyqus commented 1 year ago

Related Problem

2656 Investigate additional dimension for stats warehouse: location

The Elevator Pitch

I'm a nuget package owner for many years. I usually wanna know more about each release/version in order to understand the market better. I usually reply on https://www.nuget.org/stats/packages/NPOI?groupby=Version. But frankly speaking, this is not enough.

I also tried nugettrend.com but eventally I figure out that their data is not so good (daily based). Moreover, they stopped the development for a few years. They just implemented one chart (monthly total download comparsion between different projects). I guess it's just a showcase project.

I'd like to have the following statistics to have a deeper analysis of my package

Additional Context and Details

No response

agr commented 1 year ago

More related issues: https://github.com/NuGet/NuGetGallery/issues/8821, https://github.com/NuGet/NuGetGallery/issues/6303.

Hey, thanks for suggestions. Do you have in mind the way the data you'd want to be shared? Would regularly updated static (JSON?) blobs be enough? Web service allowing queries?

To provide some background: at present moment we process download statistics data with Apache Spark-based system. The daily ingestion is dozens of GBs of compressed raw log files. The processed data that we base our statistics generation on has daily downloads for the last ~year split by the client used and occupies ~20 GB in compressed parquet files at the moment.

I imagine, the community would not be very happy if we just dump several [dozens of] GB of parquet files with data somewhere.

Regarding the additional dimensions:

Also, we need to keep in mind that adding another dimension to exported data would explode the amount of data roughly by the cardinality of the dimension.

Realistically, I think we should start with just exporting the data that we have (say, daily downloads, by id and version, for the last 6 weeks) for each package ID in a separate blob and make it available through our CDN for download. It can later be incorporated into Gallery or if nuget-trends or anyone else would want to process it, they are welcome, too.

tonyqus commented 1 year ago

Thank you for clarifying the raw data information. It's definitely helpful.

According to what you mentioned,

Scenario of Geo data Request Geo data is more important than .NET version because I'm assessing the market size of .NET in each country, especially for China. The analysis may not just based on my package. I prefer analyzing the most popular packages such as JSON.NET because they covers more users. Although I did get basic statistis info from Visual Studio Team, they told me last year that China is the second market of .NET. This makes me confused for a few moments because what I feel is that the .NET market is shrinking quickly in China in recent years as the .NET jobs are getting rare and rare. However, as a developer, I need some trustable data to figure out what's going on and persuade myself and the local community.

After geo data is available, it's easy to figure out

I imagine, the community would not be very happy if we just dump several [dozens of] GB of parquet files with data somewhere.

I'm fine with raw data and parquet is easy to be imported. Let's see if other developers has different opinion.

Realistically, I think we should start with just exporting the data that we have (say, daily downloads, by id and version, for the last 6 weeks) for each package ID in a separate blob and make it available through our CDN for download. It can later be incorporated into Gallery or if nuget-trends or anyone else would want to process it, they are welcome, too.

Yeah, I appreciate you can export data per package in parquet or JSON format. With some official data, it's a good start to enhance the package statistics.

agr commented 1 year ago

Just to set expectations straight: the daily downloads data (without geo information at this point) looks like an easy way for us to start providing more information. It would allow us to set up a framework for this kind of work in the future as well as establishing a precedent for such work.

I will write a proposal and present it to the team. Then, if it flies, it can be expanded.

tonyqus commented 1 year ago

Is there any update about this feature? I'd like to know when it can be ready.

tonyqus commented 1 year ago

2810

tonyqus commented 3 weeks ago

Any update on this? geo nuget download data is still important for me.