Analyze NuGet.org packages 📦 using Azure Functions ⚡.
This project enables you to write a bit of code that will be executed for each package on NuGet.org in parallel. The results of the code will be collected into CSV files stored Azure Blob Storage. These CSV files can be imported into any query system you want, for easy analysis. This project is about building those CSV blobs in a fast, scalable, and reproducible way as well as keeping those files up to date.
The data sets are great for:
The data sets currently produced by NuGet Insights are listed in
docs/tables/README.md
.
We follow a 3 step process to go from nothing to a completely deployed Azure solution.
dotnet --info
git clone https://github.com/NuGet/Insights.git
dotnet publish
on the website and worker projects. This produces
compiled directories that can be deployed to Azure later.
cd Insights
dotnet publish src/Worker -c Release
dotnet publish src/Website -c Release
Read about how to run the automated tests in TESTING.md
.
PowerShell is used for the following steps. I have tested Windows PowerShell 5.1, Windows PowerShell 7.3.7, and Linux PowerShell 7.3.7.
Connect-AzAccount
bicep --version
Set-AzContext -Subscription $mySubscriptionId
./deploy/deploy.ps1 -ConfigName dev -StampName Joel -AllowDeployUser
If you run into trouble, try adding the -Debug
option to get more
diagnostic information.
This will create a new resource group with name NuGet.Insights-{StampName}
deploy several resources into it including:
When the deployment completes successfully, a "website URL" will be reporting in the console as part of a warm-up. You can use this to access the admin panel. The end of the output looks like this:
... Warming up the website and workers... https://nugetinsights-joel.azurewebsites.net/ - 200 OK https://nugetinsights-joel-worker-0.azurewebsites.net/ - 200 OK Deployment is complete. Go to here for the admin panel: https://nugetinsights-joel.azurewebsites.net/Admin
You can go the first URL which is the website URL in your web browser and click on the Admin link in the nav bar. Then, you can start a short run using the "All catalog scans" section, "Use custom cursor" checkbox, and "Start all" button.
For more information about running catalog scans, see Starting a catalog scan.
Use one of the following approaches to run Insights locally. Using Project Tye is the easiest if you have Docker installed, otherwise use a standalone Azure Storage emulator.
From Project Tye's GitHub page:
Tye is a developer tool that makes developing, testing, and deploying microservices and distributed applications easier. Project Tye includes a local orchestrator to make developing microservices easier and the ability to deploy microservices to Kubernetes with minimal configuration.
It's a great way to run the Insights website, worker, and the Azurite storage emulator all at once with a single command.
tye run
in the root of the repository.Dashboard running on http://127.0.0.1:8000
Proceed to the Starting a catalog scan section.
dotnet run --project src/Worker
from the root of the repository.dotnet run --project src/Website
from the
root of the repository.
Now listening on: http://localhost:60491
Proceed to the Starting a catalog scan section.
A catalog scan is a unit of work for Insights which runs analysis against all of the packages published during some time range. The time range for a catalog scan is bounded by the a previous catalog stamp used (as an exclusive minimum) and an arbitrary timestamp to process up to (as an inclusive maximum). For more information, see the architecture section.
Once you have opened the localhost website URL mentioned in the section above, follow these steps to start your first catalog scan from the Insights admin panel.
2015-02-01T06:22:45.8488496Z
, which is the
very first commit timestamp in the NuGet V3 catalog.If you ran a driver like Load package archive, data will be populated into
your Azure Table Storage emulator in the packagearchives
table. If you ran a
driver like Package asset to CSV, CSV files will be populated into your
Azure Blob Storage emulator in the packageassets
container. For more information on what each driver does, see the drivers list.
You can use the Azure Storage Explorer to interact with your Azure Storage endpoints (either the storage emulator running locally or in Azure).
When running locally, you can check the application logs shown in the Tye dashboard or terminal stdout. When running in Azure, you can use Application Insights (note the default logging is Warning or higher to reduce cost). You can also look at the Azure Queue Storage queues to understand what sort of work the Worker has left.
Read about the project's architecture in ARCHITECTURE.md
.
These are what the resources look like in Azure after deployment.
This is what the Azure Function looks like running locally, for the Package Manifest to CSV driver.
This is what the results look like in Azure Table Storage. Each row is a package .nuspec stored as compressed MessagePack bytes.
This is what the admin panel looks like to start catalog scans.
This is the driver that reads the file list and package signature from all NuGet packages on NuGet.org and loads them into Azure Table Storage. It took about 35 minutes to do this and costed about $3.37.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.