NuGet / NuGetGallery

NuGet Gallery is a package repository that powers https://www.nuget.org. Use this repo for reporting NuGet.org issues.
https://www.nuget.org/
Apache License 2.0
1.53k stars 644 forks source link

Is there an API into the NuGet.org datawarehouse? #2810

Open yishaigalatzer opened 8 years ago

yishaigalatzer commented 8 years ago

From @dkackman

I am trying to get at the data from the NuGet gallery that would show package usage statistics. (Downloads over time, NuGet action per download etc.)

There is the NuGet data warehouse project but I do not see any API to access the nugget.org instance, so I'm assuming that it is purely to drive what is shown on the gallery website.

Is there an API other than https://api-v3search-0.nuget.org/query or the older https://nuget.org/api/v2 OData feed that can be used to get at NuGet.org package stats?

yishaigalatzer commented 8 years ago

As far as I know we don't have anything exposed for that. I'm re-opening this in the gallery for discussion.

One way to look at it: We could make the database (or a backup of it) available for download/readonly access.

dkackman commented 8 years ago

Using the v3 search endpoint I created a simple ETL program and PowerPivot OLAP model. Since the search endpoint is more about finding packages rather than stats, it is somewhat limited in scope. 2015-12-14 Ultimately I think would be a value add to provide package authors access to analytics similar to what the various app stores provide to their authors. Perhaps via a Power BI service content pack. NuGet.xlsx

A downloadable snapshot would certainly be a place to start. I bet people could come up with some pretty interesting things like @daveaglic's dependency graph.

daveaglick commented 8 years ago

I would love the ability to download a dump of the database. About a year ago I messed around with doing some data analytics on the gallery and ended up writing a script to perform incremental pulls of new data via the OData protocol. It was a pain. Being able to grab the entire thing whenever needed would greatly simplify this sort of scenario.

BTW: here's a dependency graph for JSON.NET from a year ago - you could zoom around and stuff on the site. It was cool. Ended up giving up on it because of this exact problem...the challenge of getting and keeping up-to-date data. b6ymkswccaaotsy

maartenba commented 8 years ago

@daveaglick You can use the catalog for this type of traversal. It is an append-log tree structure which has one blob page per “Add package” or “Edit Package” action that happens in the Gallery. All metadata of the package can be accessed from here. We use this catalog ourselves to populate the search index and so on.

The root level index page of the catalog can be seen @ http://api.nuget.org/v3/catalog0/index.json. This has links to the individual pages (the second level nodes in the tree). The pages in turn will have links to the individual Package blobs (the leaf nodes the tree). Each leaf node will indicate a specific version of a package.

To traverse through the catalog, you read the index and go to the individual pages from there and then follow the link to individual packages. Each “Page” (second level) and “Package” (leaf level) has a time stamp to indicate when it was created. If you are doing continuous replication, you can maintain a time stamp marker on your end (something like “LastRead”). Whenever you replication job starts, it will get the “LastRead” marker and go to the “Pages” created after the “LastRead” time stamp and go to the individual Packages.

Sample code you say? You can clone the repo @ https://github.com/NuGet/NuGet.Services.Metadata. The https://github.com/NuGet/NuGet.Services.Metadata/blob/master/src/Ng/Catalog2Registration.cs is an example of a “CatalogReader” which reads from the catalog and creates registration blobs. It maintains a LastReader marker which is stored in “Curson.json”. On similar lines https://github.com/NuGet/NuGet.Services.Metadata/blob/master/src/Ng/Catalog2Lucene.cs is another “CatalogReader” which continuously reads from catalog and updates our search index.

maartenba commented 8 years ago

@dkackman Regarding statistics, we currently have no public API for this. We will discuss this internally and see what we can do here.

In an ideal world, what information would you like to see exposed?

daveaglick commented 8 years ago

@maartenba That's awesome, thanks! I had no idea that was available. It looks like exactly what's needed to pull down incremental package info for local replication.

dkackman commented 8 years ago

@maartenba In an ideal world I'd like to get at the record for each package download event.

Timestamp PackageId PackageVersion Operation ClientName ClientVersion Status info if that exists (success, failure, failure message) (any data from the HTTP header/user agent of the request (geography of the source IP, OS, OS version) would be a bonus)

Those data points, coupled with the package meta data available via the search endpoint, would enable time based usage reporting and trending.

yishaigalatzer commented 8 years ago

We discussed releasing this kind of data to the public, but at this point we decided to keep operational data like this internal.

1st we aggregate is very early and don't keep all of it around in an accessible fashion 2nd there is a significant effort to make it publicly available and that is lower on our sights 3rd some of the requests like clients and ip address might expose pii and we are not ready to take responsibility for that at the moment.

We could reconsider in the future after we handle bigger ticket items we have lined up for nuget.org

dkackman commented 8 years ago

Yeah I wouldn't want to get the actual IP address but knowing what sorts of client machines and/or what countries are using a package would be a useful. Aggregated versions of the event data would also be useful.

Any chance to get a hold of a copy of the warehouse database like you referenced earlier in the thread?

michael-hawker commented 5 years ago

It'd be nice to be able to get a general query of downloads for specific time ranges, just like the npm api does [within the last X amount of time, e.g. 18 months is what npm stores for historical data].

This would make it much easier to see trends over time in package releases and growth over time than just being able to get the total download count at present.