NuGet / NuGetGallery

NuGet Gallery is a package repository that powers https://www.nuget.org. Use this repo for reporting NuGet.org issues.
https://www.nuget.org/
Apache License 2.0
1.54k stars 643 forks source link

Reading through package audit blobs is inefficient #3981

Open scottbommarito opened 7 years ago

scottbommarito commented 7 years ago

Currently, our package audit blobs are formatted in the current pattern:

package/<id>/<version>/<guid>.<operation>.json

Because blob storage can only be queried by prefixes, querying the package audit blob by any metric that is not the package that is being modified is incredibly slow. For example, if one wishes to access the package audit blobs that have been added in the last 15 minutes, they must first list every blob and then sort it themselves, which is horribly inefficient.

Unfortunately this is a required function of Feed2Catalog, both of which must access the deleted packages up to a timestamp.

We should change the way we store package audit logs to a form that we can query more efficiently.

This could be a new format...

<timestamp in ticks>/<id>/<version>/<operation>.json

...or we could move it to another service, such as table storage.

xavierdecoster commented 7 years ago

I'd suggest even sort by default using a reverse timestamp, so latest blobs are available first

cristinamanum commented 6 years ago

@dtivel - seems to be related to the work you are doing?

dtivel commented 6 years ago

Yup