apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.67k stars 1.04k forks source link

fix SortedDocValues to no longer extend BinaryDocValues [LUCENE-9796] #10835

Closed asfimport closed 3 years ago

asfimport commented 3 years ago

SortedDocValues give ordinals and a way to derefence ordinal as a byte[]

But currently they extend BinaryDocValues, which allows directly calling binaryValue().

This allows them to act as a "slow" BinaryDocValues, but it is a performance trap, especially now that terms bytes may be block-compressed (#10702).

I think this should be detangled to prevent performance traps like #10834: SortedDocValues shouldn't have the trappy inherited binaryValue() method that implicitly derefs the ord for the doc, then the term bytes for the ord.


Migrated from LUCENE-9796 by Robert Muir (@rmuir), resolved Mar 15 2021 Attachments: LUCENE-9796_prototype.patch, LUCENE-9796.patch Linked issues:

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There are quite a few abusers in tests, grouping, etc here, so the issue isn't particularly easy.

But in order to properly support compression for the SortedDocValues, I think we should fix it, so that there aren't traps in the API. Even if we have to deprecate stuff and do it iteratively.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I attached a prototype of what I had in mind: it at least compiles :)

Now it becomes explicit in the source code where the term dict lookups are happening as you see clearly the lookupOrd() call.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Updated patch, fixing tests. I didn't run nightly or anything crazy like that though.

asfimport commented 3 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

+1

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Whew, finally got through those nightly tests (see LUCENE-9840).

I opened #10880 as a followup here to fix some of the actual performance traps (this issue is simply an API change).

asfimport commented 3 years ago

ASF subversion and git services (migrated from JIRA)

Commit f3a284ad830b9473467b5cb364408db5e4cc607f in lucene's branch refs/heads/main from Robert Muir https://gitbox.apache.org/repos/asf?p=lucene.git;h=f3a284a

LUCENE-9796: Fix SortedDocValues to no longer extend BinaryDocValues

SortedDocValues do not have a per-document binary value, they have a per-document numeric ordValue(). The ordinal can then be dereferenced to its binary form with lookupOrd(), but it was a performance trap to implement a binaryValue() on the SortedDocValues api that does this behind-the-scenes on every document.

You can replace calls of binaryValue() with lookupOrd(ordValue()) as a "quick fix", but it is better to use the ordinal alone (integer-based datastructures) for per-document access, and only call lookupOrd() a few times at the end (e.g. for the hits you want to display). Otherwise, if you really don't want per-document ordinals, but instead a per-document byte[], use a BinaryDocValues field.

This change only addresses the API (slow binaryValue() trap), but doesn't yet fix any slow algorithms that were discovered in the process, so it doesn't yield any performance improvements.

asfimport commented 3 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Solr main branch build broke after this change. I worked on SOLR-15261 to fix that. We'll have this kind of issues until Solr uses a Lucene snapshot.

While fixing I noticed that maybe o.a.l.search.FieldComparator$TermValComparator.getLeafComparator() needs to be adapted to not create only BinaryDocValues but also SortedDocValues?

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Bruno can we make a followup issue for that? I really don't think we should do this. For SortedDocValues only TermOrdValComparator should be used.

asfimport commented 3 years ago

Bruno Roustant (@bruno-roustant) (migrated from JIRA)

Ok, I'll try to update Solr to use TermOrdValComparator, reusing the same issue.

asfimport commented 3 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It may be calling the wrong type of SortField (STRING vs STRING_VAL) IIRC. Sorry you are having trouble with solr, and thank you for working on it.

I am test-driven about such refactorings, and it is difficult/impossible for me to help with my tiny 2-core laptop (the solr tests run forever). As a side note, I did try Mark Miller's branch and the tests there are quite usable and run as fast as lucene tests, so it might be a way to help make life easier.

asfimport commented 2 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.0.0 release