Closed joelverhagen closed 9 months ago
Step 1: partition packages into indexable buckets. This commit does this. All packages on nuget.org will be partitioned into 1 of 1000 buckets in table storage https://github.com/NuGet/Insights/commit/98cd6b72959760f934371e30435915c57048a206
Step 2: enable catalog scan on a ranges of bucket indexes instead of on timestamps This commit does this. A table scan on a given range of bucket indexes yield catalog leaf scan entities and they are processable by drivers in much the same way as a catalog scan on a commit timestamp range https://github.com/NuGet/Insights/commit/48f117469cb009b773ec276516d4b8acdd187aec
Step 3: automatically reprocess ranges of bucket indexes over a given time window This commit does this. https://github.com/NuGet/Insights/commit/f13701b03af7d11bf17307f5cbb7a25d756e14bc
Some data on NuGet.org can be updated without a corresponding catalog leaf getting added. For example:
Maybe every week or two every package should be checked for updates in these regards.