NuGet / Insights

Gather insights about public NuGet.org package data
Apache License 2.0
24 stars 7 forks source link

Implement recurring scan mechanism for package data that does not affect catalog #67

Closed joelverhagen closed 9 months ago

joelverhagen commented 2 years ago

Some data on NuGet.org can be updated without a corresponding catalog leaf getting added. For example:

  1. Adding, updating, or deleting symbols (.snupkg)
  2. Adding, updating, or legacy readmes

Maybe every week or two every package should be checked for updates in these regards.

joelverhagen commented 9 months ago

Step 1: partition packages into indexable buckets. This commit does this. All packages on nuget.org will be partitioned into 1 of 1000 buckets in table storage https://github.com/NuGet/Insights/commit/98cd6b72959760f934371e30435915c57048a206

joelverhagen commented 9 months ago

Step 2: enable catalog scan on a ranges of bucket indexes instead of on timestamps This commit does this. A table scan on a given range of bucket indexes yield catalog leaf scan entities and they are processable by drivers in much the same way as a catalog scan on a commit timestamp range https://github.com/NuGet/Insights/commit/48f117469cb009b773ec276516d4b8acdd187aec

joelverhagen commented 9 months ago

Step 3: automatically reprocess ranges of bucket indexes over a given time window This commit does this. https://github.com/NuGet/Insights/commit/f13701b03af7d11bf17307f5cbb7a25d756e14bc