Closed asfimport closed 3 years ago
Robert Muir (@rmuir) (migrated from JIRA)
There are quite a few abusers in tests, grouping, etc here, so the issue isn't particularly easy.
But in order to properly support compression for the SortedDocValues, I think we should fix it, so that there aren't traps in the API. Even if we have to deprecate stuff and do it iteratively.
Robert Muir (@rmuir) (migrated from JIRA)
I attached a prototype of what I had in mind: it at least compiles :)
Now it becomes explicit in the source code where the term dict lookups are happening as you see clearly the lookupOrd()
call.
Adrien Grand (@jpountz) (migrated from JIRA)
+1
Robert Muir (@rmuir) (migrated from JIRA)
Updated patch, fixing tests. I didn't run nightly or anything crazy like that though.
Bruno Roustant (@bruno-roustant) (migrated from JIRA)
+1
Robert Muir (@rmuir) (migrated from JIRA)
Whew, finally got through those nightly tests (see LUCENE-9840).
I opened #10880 as a followup here to fix some of the actual performance traps (this issue is simply an API change).
ASF subversion and git services (migrated from JIRA)
Commit f3a284ad830b9473467b5cb364408db5e4cc607f in lucene's branch refs/heads/main from Robert Muir https://gitbox.apache.org/repos/asf?p=lucene.git;h=f3a284a
LUCENE-9796: Fix SortedDocValues to no longer extend BinaryDocValues
SortedDocValues do not have a per-document binary value, they have a
per-document numeric ordValue()
. The ordinal can then be dereferenced
to its binary form with lookupOrd()
, but it was a performance trap to
implement a binaryValue()
on the SortedDocValues api that does this
behind-the-scenes on every document.
You can replace calls of binaryValue()
with lookupOrd(ordValue())
as a "quick fix", but it is better to use the ordinal alone
(integer-based datastructures) for per-document access, and only call
lookupOrd() a few times at the end (e.g. for the hits you want to display).
Otherwise, if you really don't want per-document ordinals, but instead a
per-document byte[]
, use a BinaryDocValues field.
This change only addresses the API (slow binaryValue()
trap), but
doesn't yet fix any slow algorithms that were discovered in the process,
so it doesn't yield any performance improvements.
Bruno Roustant (@bruno-roustant) (migrated from JIRA)
Solr main branch build broke after this change. I worked on SOLR-15261 to fix that. We'll have this kind of issues until Solr uses a Lucene snapshot.
While fixing I noticed that maybe o.a.l.search.FieldComparator$TermValComparator.getLeafComparator() needs to be adapted to not create only BinaryDocValues but also SortedDocValues?
Robert Muir (@rmuir) (migrated from JIRA)
Bruno can we make a followup issue for that? I really don't think we should do this. For SortedDocValues only TermOrdValComparator should be used.
Bruno Roustant (@bruno-roustant) (migrated from JIRA)
Ok, I'll try to update Solr to use TermOrdValComparator, reusing the same issue.
Robert Muir (@rmuir) (migrated from JIRA)
It may be calling the wrong type of SortField (STRING vs STRING_VAL) IIRC. Sorry you are having trouble with solr, and thank you for working on it.
I am test-driven about such refactorings, and it is difficult/impossible for me to help with my tiny 2-core laptop (the solr tests run forever). As a side note, I did try Mark Miller's branch and the tests there are quite usable and run as fast as lucene tests, so it might be a way to help make life easier.
Adrien Grand (@jpountz) (migrated from JIRA)
Closing after the 9.0.0 release
SortedDocValues give ordinals and a way to derefence ordinal as a byte[]
But currently they extend BinaryDocValues, which allows directly calling
binaryValue()
.This allows them to act as a "slow" BinaryDocValues, but it is a performance trap, especially now that terms bytes may be block-compressed (#10702).
I think this should be detangled to prevent performance traps like #10834: SortedDocValues shouldn't have the trappy inherited
binaryValue()
method that implicitly derefs the ord for the doc, then the term bytes for the ord.Migrated from LUCENE-9796 by Robert Muir (@rmuir), resolved Mar 15 2021 Attachments: LUCENE-9796_prototype.patch, LUCENE-9796.patch Linked issues: