apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Port index sorter to trunk APIs [LUCENE-3918] #4991

Closed asfimport closed 11 years ago

asfimport commented 12 years ago

3556 added an IndexSorter to 3.x, but we need to port this

functionality to 4.0 apis.


Migrated from LUCENE-3918 by Robert Muir (@rmuir), 2 votes, resolved Mar 10 2013 Attachments: LUCENE-3918.patch (versions: 16) Linked issues:

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch fixing tests to not suppress whole codecs.

instead the testSortedSet() has an assume (and is ignored for ancient codecs).

in the case of offsets, ancient codecs just index and test docs/freqs/positions without offsets

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I use two parallel arrays to sort the documents (docs and values)

I updated the patch to use doc IDs as ords so that values are never swapped (only doc IDs) and the numeric doc values don't need to be all loaded in memory.

So one option is to remove the class, but still keep a test around which does the addIndexes to make sure it works.

+1

I don't want however to add a main that is limited to NumericDV ... and I do think that stored fields / payload value are viable options.

I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks Rob - I didn't know we can check these things :). Certainly better than suppressing the entire Codec.

Adrien, thanks for the update as well. So if someone loads NumericDV (default), indeed there's no need to copy the values again into an array. If someone uses DiskDVFormat though, list.get(i) will access the disk on every call ... but I guess that's fine since if someone wanted to save RAM, he should be ready to pay the price, and we should respect him.

I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results?

Someone might have an existing index without DV. Also, who said that a stored field used for display cannot be used to sort the index? But, since it's quite trivial to implement, I'll remove both Payload and StoredFields. I'll also make Reverse and Numeric sorters inner classes (though public) of Sorter.

I added a check in SortingAtomicReader ctor that old2new.length == reader.maxDoc(), to ensure that sorters provide a mapping for every document in the index. I'll get rid of IndexSorter, but keep a test around + add to SortingAR javadocs code example how to use it for addIndexes.

Will upload a new patch later.

asfimport commented 11 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

I still don't get why someone would use stored fields rather than doc values (either binary, sorted or numeric) to sort his index. I think it's important to make users understand that stored fields are only useful to display results?

This is a legacy of the original usage of this tool in Nutch - indexes would use a PageRank value as a document boost, and that was the value to be used for sorting - but since the doc boost is not recoverable from an existing index the value itself was stored in a stored field.

And definitely DV didn't exist yet at that time :)

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch removes IndexSorter (but keeps IndexSortingTest). I also:

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I think it's ready. If there are no objections, I'd like to commit it later today.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch optimizes not encoding offsets in memory if offsets are not indexed. This saves 10 bytes per position for most cases (since offsets are not indexed by default, even for positions-enabled fields, e.g. TextField).

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Optimize not encoding freqs in memory if freqs were not indexed (even if they are requested in flags).

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[trunk commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1454801

LUCENE-3918: port IndexSorter to trunk API

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Committed to trunk and 4x. Thanks Anat, your work has re-ignited this issue!

asfimport commented 11 years ago

Commit Tag Bot (migrated from JIRA)

[branch_4x commit] Shai Erera http://svn.apache.org/viewvc?view=revision&revision=1454804

LUCENE-3918: port IndexSorter to trunk API

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Closed after release.