NuGet / NuGetGallery

NuGet Gallery is a package repository that powers https://www.nuget.org. Use this repo for reporting NuGet.org issues.
https://www.nuget.org/
Apache License 2.0
1.54k stars 644 forks source link

Define order for packageid: query results for FindPackagesById hijack #5432

Closed joelverhagen closed 4 years ago

joelverhagen commented 6 years ago

Today, the order is defined by downloads then doc ID.

Related: https://github.com/NuGet/NuGetGallery/issues/5431

joelverhagen commented 6 years ago

The root cause is that gallery specifies a document ordering for FindPackagesById hijacks that has an unpredictable order for documents with the same ID. The hijack passes sortBy=relevance which is an unmapped sort by mode. Therefore, search service falls back to the default sort by mode, which is by Lucene query score (which is essentially our notion of relevance).

For packages with the same ID, it is very likely that many of the versions have similar metadata. Therefore, Lucene assigns the same score. In the NLog case, there were only two distinct scores assigned to the 27 versions we tested. Therefore, within the two score groupings, the order of the documents was defined by their Lucene document ID.

The Lucene document ID is an incrementing integer generated by Lucene when a document is added to the index. When a package's unlist status changes or if the package is reflowed, catalog2lucene performs an update on the document, which in Lucene is actually a delete then add. Therefore, the document ID for that package changes due to user or admin action.

Additionally, when an index is rebuilt, many if not all of the document IDs will change, since the order in which they were added to the index is based off of database (db2lucene) instead of catalog.

In other words, this problem happens any time we do an index rebuild. The degree of impact is based off of how recently we have done an index rebuild and how many old packages have been reflowed, relisted, or unlisted. If a package ID has had no new versions published since the last index rebuild and no catalog items have been produced, the relative ordering of the versions will remain constant.

joelverhagen commented 6 years ago

My proposed fix is to change the sortBy that gallery passes to search service during a hijack. The sort should have the following principles:

  1. Defined by some document field, not document ID. This means the sort will remain consistent from one index rebuild to the next.
  2. Monotonically increasing. Having new documents appear in the middle of the order means that paging performed as that document is added will miss the new document. New documents should appear at the end of the results.
  3. Immutable per version. An existing package should not be able to move to the edit of the list due to some user or ops action.

There are four supported sortBy modes today (excluded the default relevance sort). https://github.com/NuGet/NuGet.Services.Metadata/blob/25b24c93fc6bfd88511cd8d6cc65a41cf8a32d9d/src/NuGet.Indexing/GalleryServiceImpl.cs#L124-L129

published is not appropriate since unlisted packages all have the same published date: year 1900 (except for https://github.com/NuGet/NuGetGallery/issues/5435).

title-asc and title-desc are not appropriate since a lot of packages all have the same title.

lastEdited is not appropriate since it is not immutable per version. Also, last edited resolution is day so order is defined by document ID if two packages were edited on the same the day.

Therefore, I propose introducing a new sortBy which is created and based off of the packages created date. Note that this isn't perfectly monotonically increasing but it should be Close Enough™. Note that package key has the same following caveats as created and therefore doesn't help us.

  1. Hard delete then recreate will cause a change, breaking principle number 3.
  2. Async validation means packages come in out of order, breaking principle number 2.
  3. DB commits of new packages do not necessarily get commits in the same order they are started, also breaking principle number 2.

tl;dr: the best I think we can do is by making FindPackagesById() ask for results sorted by package created date, which requires a new document field CreatedDate and a new sort field.

joelverhagen commented 6 years ago

The performance difference between USSC and USNC hijack could be analyzed in the gallery (latency).