apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Port index sorter to trunk APIs [LUCENE-3918] #4991

Closed asfimport closed 11 years ago

asfimport commented 12 years ago

3556 added an IndexSorter to 3.x, but we need to port this

functionality to 4.0 apis.


Migrated from LUCENE-3918 by Robert Muir (@rmuir), 2 votes, resolved Mar 10 2013 Attachments: LUCENE-3918.patch (versions: 16) Linked issues:

asfimport commented 11 years ago

Anat (migrated from JIRA)

We have been working on porting IndexSorter to trunk API as well as added some more features to it. The IndexSorter sorts documents in the index according to Id's, DocValues or Payloads, each of these is a diffrent implementation of the Sorter interface. In addition we provide a way to correctly access and iterate over documents in a sorted index such as SortingIndexReader, SortingFields, SortingTerms etc'.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Wow, big patch! It looks good! Some comments:

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

the 3.6 impl seemed somewhat more RAM efficient (left things on disk and seek'd on each lookup)

From what I remember (and I looked at the code again now), the 3.6 impl read the posting into a RAMOutputStream:

    private static final String TEMP_FILE = "temp";
    private final RAMDirectory tempDir = new RAMDirectory();
    private RAMOutputStream out;

You can look at how seek(TermsEnum) iterates on original, writing everything to out and then calls to next() are served from this in-memory file. That concept has not changed in this version.

I agree that List<Integer> seems an overkill here, and a growable int[] would be better. Since the array is sorted, perhaps we can grow the array (using ArrayUtil.grow), but keep an 'upto' and fill the extra values in the array with Integer.MAX_VALUE, so that they are always sorted last (instead of trimming the array). Then nextDoc() will just iterate up to 'upto'.

Some places seem to open a DirectoryReader but not close it, eg DocumentSorter.oldToNew.

I think first that tests need to extend LuceneTestCase (from what I can tell, they don't now) and then use newDirectory(). This should catch all these places I think?

Also, from a quick at the tests:

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

From what I remember (and I looked at the code again now), the 3.6 impl read the posting into a RAMOutputStream:

OK but that approach is quite memory efficient (the postings are stored compressed in RAM) ... vs creating java objects per-doc and per-position (as SortingD&PEnum does in the patch).

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Yeah, I see what you mean. Instead of Map<Integer,Position>, we could encode the positions (including payload) in a stream more efficiently and then sort a Position class which contains the docId + offset in the encoded stream. I'm guessing we could do that using a growable DataOutput (instead of RAMOutputStream which uses arrays of 1024 bytes) or something like that.

I think that will indeed be more efficient. Also the Map is a bit of an overkill

asfimport commented 11 years ago

Anat (migrated from JIRA)

Thanks for the feedback, I'll look into it and post back when I have any updates.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

I was thinking today that it would be nice if the a document's stored values file could be sorted by some field (e.g. a geohash or time or...). Wouldn't that be cool? Would such a feature obsolete the use of the index sorter that this issue speaks of?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

One of the usecases for IndexSorter is for when you want to e.g. do quick early termination for a query. Then, if you sort the index (say by date), and you're interested in documents sorted by date, then you can do a fast early termination after visiting exactly N docs.

I'm not sure that I understand your proposal though and whether it'll help such usecases?

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Oh, that's interesting. My proposal doesn't help your early termination query use-case. It does help cache-locality for spatial oriented queries, or queries in the vicinity of a time "recent" or whatever.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I didn't know about IndexSorter. It would be nice to have a codec wrapper that would override all *Writer/Consumer.merge methods to make the merged segments sorted. This could help make the I/O cache work happily with stored fields/term vectors and early terminate queries after having visited min(maxBufferedDocs, numHits) documents without requiring the index to be read-only nor to have to store a doc ID mapping in memory (but at a higher indexing/merging cost)?

asfimport commented 11 years ago

Eks Dev (migrated from JIRA)

this is the right way to give some really good meaning to venerable optimize call :)

We were, and are sorting our data before indexing just to achieve exactly this, improvement in locality of reference. Depending on data (has to be somehow sortable, e.g. hierarchical structure, on url...), speedup (and likely compression Adrian made) gains are sometimes unbelievable...

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Then lets do this! (the right way, in memory during indexing before it hits the disk, not re-ordering existing on-disk segments after the fact). I created #5817 for this. I won't be the one to implement it, at least not now.

asfimport commented 11 years ago

Anat (migrated from JIRA)

Added an updated version of the patch containing changes relating to the comments.

The major changes are: Tests now extend LuceneTestCase List<Integer> was replaced with a growable int array. Instead of using a java Map object we are now encoding the positions information in a DataOutput stream.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

This patch seems to have been created against an old 4.x branch (4.0 or 4.1 maybe?). We usually commit new features to trunk before backporting them to branch_4x, could you update the patch so that it can apply on top of trunk? Some comments on this new patch:

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch addresses the following:

Also,

...shouldn't this interface directly take an IndexReader as an argument?

You're right, fixed in this patch too!

SorterUtil.sort uses the stored fields API...

Fixed in the patch!

SortingIndexReader constructor expects a CompositeIndexReader...

Fixed!

Why does SortingIndexReader.getLiveDocs always return null?

I don't remember at the moment why, but this utility currently doesn't handle source indexes with deleted documents ... again, not sure why, maybe the old sorter didn't support them too? Anyway, I made SortingIndexReader yell at you if you pass such a reader, and we can separately (or still in this issue) enhance it to support such indexes. I will think about it.

I wonder if perhaps instead of To we should have an extension point on each of the sorters, to parse the relevant information, returning a Comparable?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Then lets do this! (the right way, in memory during indexing before it hits the disk, not re-ordering existing on-disk segments after the fact)

This issue is about porting the previous IndexSorter implementation to trunk API. The previous one offered a one-time sorting of an index, so is this one. While that doesn't mean we shouldn't explore alternatives, I find it a much lower hanging fruit than #5817, especially as no one yet assigned the issue to himself, nor it looks like any progress was made. If #5817 will eventually see the light of day, I don't mind if we nuke IndexSorter completely (by a SortingCodec I guess?), but until then, I think that offering users A way to sort their index is valuable too.

Also, it's not clear to me at the moment (but I admit I haven't thought about it much) how can you sort documents during indexing, while the values to be sorted by may still be unknown? I.e. what if your sort-by-key is a NumericDocValues which the Codec hasn't seen yet? How should it write posting lists, stored fields etc.? Does this mean the Codec must cache the entire to-be-written segment in RAM? That will consume much more RAM than the approach in this issue ...

I think that online sorting is much more powerful than a one-time sort, but there's work to do to make it happen, and efficiently. Therefore until then, I think that we should proceed with this offline sorting strategy, which is better than nothing.

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I would rename SortingIndexReader to SortingAtomicReader for consistency. We did this with all 3.x IndexReaders.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks Uwe, good idea! I also made it take a Sorter and not int[] old2New, which is too low-level IMO.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Got rid of To. StoredDocumentSorter and PayloadSorter are now abstract classes, with an abstract parse() method, taking the relevant input and return a Comparable.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Also, it's not clear to me at the moment (but I admit I haven't thought about it much) how can you sort documents during indexing, while the values to be sorted by may still be unknown? I.e. what if your sort-by-key is a NumericDocValues which the Codec hasn't seen yet? How should it write posting lists, stored fields etc.? Does this mean the Codec must cache the entire to-be-written segment in RAM? That will consume much more RAM than the approach in this issue ...

I totally agree with Shai.

If someone opened and issue to provide an option where IndexWriter constantly optimized in the background, i would be totally against that. Same goes for it doing sorting in the background: unless someone comes up with some serious magic, i dont see this happening with acceptable runtime.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I noticed that package.html got seriously outdated, so I shortened it and removed class references from it. From my experience, it's not easy to maintain these list of classes in package.html, especially on an experimental feature.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch with the other docvalues types too.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Well, in theory, one could do the following:

But that's all in theory, none of the API exist today. Also, even if you did that, your index would still be unsorted, because a sorted segment says nothing about the total sort order of the index...

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

patch with the other docvalues types too

Thanks Rob!

Indeed, I didn't separate between the ability to sort by other DocValues types, to reading them from an index, given some permutation on the docs. So now we have full support for reading such values from the index, given a permutation on the docs.

For sorting by such values, we offer for now only NumericDVSorter. Others can be added later on demand.

I think that in terms of the API and functionality, this is quite ready. However, we should beef up the tests .. for instance:

We don't need multi-threaded tests, as IndexSorter is not written for use in such scenarios.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

in my mind: the most important thing testing such tools is to ensure that they won't lead to IndexWriter.addIndexes making a corrupt index.

previously you could only really check this by calling IW.addIndexes and checkIndex on the resulting index.

but now we also have _TestUtil.checkReader(IndexReader) so you can verify consistency of the reader itself: makes it easier to debug.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ok, I'm making some changes to the tests now, compact them a little bit and add some randomness. I'll factor in checkReader too!

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

the slowwrapping in the sortingatomicreader must be removed before committing as well.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

the slowwrapping in the sortingatomicreader must be removed before committing as well.

How? This utility sorts an entire index ... it doesn't work per-segment. The old2New mapping is a global mapping. I thought about this, but I didn't see any real choice. I.e., even if we sort each segment separately, we'd still need to sort the index globally (see my comment about the online sorting).

If u have an idea how it can be per-segment, I'll be happy to hear.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

please, make it only take an atomic one in its ctor. user can slow-wrap themself.

The current api is bogus.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Ok I will. That way, one can use it to achieve per-segment sorting .. if it means something for anyone (perhaps for better compression).

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Shai, why do you think index sorting is done after it has been written (this patch) v.s. done before writing (LUCENE-4752 proposal)? Is it harder and more complicated to write it sorted in the first place? Or maybe it would slow writing down when this is the kind of thing that could be scheduled to a nightly job? Just guessing. The lack of support for deleted documents is surprising; is this API typically used in a batch job that first deletes then sorts?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

The approach taken by this issue (well, originally by the 3x IndexSorter) is that you do an offline sorting of an entire index. So if you e.g. have a 10-segments index, you end up with a single segment, totally sorted across all documents.

At least from my understanding of how the online sorting would work (#5817), the Codec would need to determine beforehand the permutation on the documents, or build an in-memory segment and then when it's done, sort it and write it sorted, right? Otherwise, I don't understand how it can handle these series of addDocuments (assume the value denotes the location of the document in the sorted index): doc(2), doc(1), doc(7), doc(0)...? The stored fields and term-vectors are not cached in-memory today. The location of the document in the sorted index is unknown until all keys (by which you sort) are encountered, which may be too late for the Codec?

And even if you get passed that hurdle (say you're willing to cache everything in-memory and then flush to disk sorted), how will you handle merges? So now you have an index with segments 1,2,3 (each sorted). How do you merge-sort them? Today, you don't have the API for it, so let's say that we add it (plugging-in your own SegmentMerger). Now MP selects segments 1,2 for merge, so you end up with segments 3,4, which are again each sorted separately, but the index is not globally sorted, right? In a sorted index, the segments need to have a consistent > (or <) relationship between the segments .. or otherwise you're just traversing documents in random order.

In short, if you do come up with a reasonable way to do online index sorting (on LUCENE-4752), I'll be all for it. And if it will make sense, we can even drop the offline index sorter too. But I think that there are many challenges in getting it right, and efficiently. It's not a mere Codec trick IMO.

Also, note that as far as memory consumption for offline sorting, we only cache in memory the current posting lists that's sorted (the rest relies on pre-existing random access API).

But, I could be totally missing your idea for online sorting, in which case I'd appreciate if you elaborate how you think it can be done. But I prefer that we discuss that on #5817.

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Thanks so much for your insights Shai. I responded on #5817 and I added you to the watcher list there.

asfimport commented 11 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

Shai,

That way, one can use it to achieve per-segment sorting .. if it means something for anyone (perhaps for better compression).

Global ordering over a multi-segment index seems less useful than the local per-segment ordering - after all, the main use cases are compression improvements and early termination strategies, both of which operate on a per-segment basis.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch:

In SortingAtomicReaderTest I had to call _TestUtil.checkReader with crossCheckTermVectors=false. I put a comment why, but I'll repeat it here. Let's say that your permutation is [2,0,1] which means doc 0 goes to 2, 1 goes to 0 and 2 goes to 1. You then ask for termVectors(0), and receive, correctly, those of document 2. Now, when crossCheck validation occurs, it iterates on the terms of the TV, expecting to find each term associated with document 0 (since CheckIndex doesn't know about sorting reader, and that in fact it works now w/ doc #2). However, the terms of doc #2 are mapped, again - correctly, to document #1 (as the permutation specifies). Therefore CheckIndex fails since it expects doc 0 but receives doc 1.

Unless I misunderstand how this cross-checking should work, I think that it's ok to disable cross-checking on the reader, in this case. In SorterUtilTest, where I addIndexes, I do checkIndex with cross-checking, and there it passes (as it should).

All that's missing is CHANGES, but I'll let others review ...

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I agree that per-segment sorting may improve compression, but I don't see how it can help with early termination. If Lucene allowed you to terminate on a per-segment basis then maybe, but today (as far as I know), the only way to early terminate a query is by throwing an exception and catching it outside, right?

At any rate, per-segment sorting doesn't help much when segments are merged, because then you lose sorting again. Unless, we add some API to IW / SegmentMerger for you to control how the segments are merged...

asfimport commented 11 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Global ordering over a multi-segment index seems less useful than the local per-segment ordering - after all, the main use cases are compression improvements and early termination strategies, both of which operate on a per-segment basis.

Those are two "main use-cases" but there is a third that (to me) is far more important than either – locality of reference. Retrieving documents near each other (according to the configured sort order) should ideally have their stored fields data very close together. The early-termination use-case is clearly important to you but is not the goal of #5817.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch adds support for indexes with deleted documents too. While doing that, I noticed that the implementation was slightly buggy – it only used an old2new mapping, which is important for mapping the posting lists, but it also had to use a new2old mapping, for "random access" methods such as document(), termVectors() etc...

I enhanced the tests to cover these cases and now there's a:

I put a comment on Sorter.old2new that implementations must return a mapping for deleted documents too. it is important for SortingAtomicReader to operate correctly, however these documents are dropped when the index is actually sorted.

I would like to give the tests some more runs (seems that everytime I thought I'm done, a new seed uncovered another bug).

I already know that the tests cannot work w/ Lucene40 and Lucene41 codecs (because of SortedDV), but also with SepPostingsWriter (since it doesn't support indexing offsets, and I test that too). How can I disable it for the tests? I didn't see a SepCodec or something that I can add to SuppressCodecs...

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I already know that the tests cannot work w/ Lucene40 and Lucene41 codecs (because of SortedDV), but also with SepPostingsWriter (since it doesn't support indexing offsets, and I test that too). How can I disable it for the tests? I didn't see a SepCodec or something that I can add to SuppressCodecs...

Can you just add "MockSep", "MockFixedIntBlock", "MockVariableIntBlock" to the @SuppressCodecs (it can contain posting format names too...). Hmm maybe also "MockRandom".

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

I added to SuppressCodecs the ones Mike mentioned plus MockRandom since it also picks randomly SepWriter.

Improved some javadocs, more cleanups. Added CHANGES.

I think it's ready, would appreciate if someone can take another look.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Thanks for your work Shai. Indeed it looks really good now! Here a a few suggestions/questions:

+      // we cannot reuse the given DocsAndPositionsEnum because we return our
+      // own wrapper, and not all Codecs like it.

Maybe we could check if the docs enum to reuse is an instance of SortingDocsEnum and reuse its wrapped DocEnum?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Are there actual use-cases for sorting by stored fields or payloads? If not I think we should remove StoredFieldsSorter and PayloadSorter?

The original 3x implementation allowed sorting only by stored fields, so we thought if it was useful to someone then, it might be useful now. And sort by payload value was useful in the past, before DV days, and e.g. we still have code that stores sort-by values in payloads. I don't think it's harmful to keep PayloadSorter, but I don't mind removing it as well.

Remove IndexSorter.java

IndexSorter is a convenient utility for sorting a Directory end-to-end. Why remove it? On the opposite, I was thinking if we want to make it more of an instance utility, i.e. that you do something like new IndexSorter().sort() and perhaps allow to sort a Directory in-place (by opening an IW, deleteAll() and addIndexes SortingReader...).

make SortDoc package-private

SortDoc, while not visible on the API today, might be useful for Sorter implementations, as a convenient way to create the old2new[]. Otherwise, if you need to sort both 'docs' and 'values', you need to work with something similar to SorterTemplate. I can remove it and implement sorting docs[] and values[] by each Sorter impl, or make Sorter an abstract class (parameterized with T exnteds Comparable<T> which does the sorting for you?

Maybe we could check if the docs enum to reuse is an instance of SortingDocsEnum and reuse its wrapped DocEnum?

Yeah, I've been thinking about it while writing the comment. Since FilterDocsEnum doesn't expose 'in' in a public way, I was wondering if I should expose the wrapped 'in' in SortingXYZ just for that purpose. How important is it to reuse DocsEnum/AndPositions?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

or make Sorter an abstract class (parameterized with T exnteds Comparable<T> which does the sorting for you?

Hmm, Sorter should not be parameterized since it only returns an old2new[] mapping. The interface doesn't care about the values at all.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Patch does:

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The current tests suppress a long list of codecs because some methods test full functionality including some newer index features.

I think if we are going to test this way, then sorting reader itself should throw unsupported operation exception (refuse to sort segments from 3.x, 4.0, 4.1 codecs).

Because its really unsupported/untested. otherwise, we should pull these newer methods out.

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Good idea. Is there a quick way to identify such indexes? Basically I need to iterate over leaves() and check version/Codec of each? Since we get an AtomicReader, what will leaves() return for a SlowWrapper?

asfimport commented 11 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Since we get an AtomicReader, what will leaves() return for a SlowWrapper?

Itsself - because its atomic :-) With a slow wrapper you have no chance to get the underlying codec information. With atomic readers implemented by sementreader there might be solutions.

asfimport commented 11 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Regarding PayloadSorter and StoredFieldsSorter I'm just afraid that the fact that they exist might make users think these are viable options...

IndexSorter is a convenient utility for sorting a Directory end-to-end. Why remove it?

I think taking an AtomicReader as an argument (instead of a Directory) and feeding an IndexWriter (instead of another Directory) would be much more flexible but then it would just be a call to IndexWriter.addIndexes... If we want an utility to sort indexes, maybe it should rather be something callable from command-line? (java oal.index.sorter.IndexSorter fromDir toDir sortField)

Get rid of SortDoc. Sorter is now abstract class with a helper int[] compute(int[] docs, T[] values)

I think it's better! Maybe a List instead of an array would be even better so that NumericDocValuesSorter could use a view over the doc values instead of loading all of them into memory?

Reusage of DocsEnum looks great!

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Robert, now that I think about it again, I'm not sure if SortingAtomicReader should do anything to prevent usage over such Codecs. The list of codecs that do not support indexing offsets are there because the test indexes offsets. But SortingAtomicReader itself doesn't care about whether the Codec supported or not offsets – it will get -1 for offsets (that's what happened before I set IndexOptions correctly!). I think that's fine?

As for the new DV features, again, the reader itself is immune to this? I.e. you would get same exception if you opened a 4.0 index and called getSortedSetDV(i)?

I think it's fine that the reader doesn't yell at you, and I wish we could test those codecs somehow, ignoring the unsupported methods, but I also think that these tests are not the ones that just must test all Codecs. It's a generic filter reader and it will yell at you just like any other filter reader will, if you call unsupported API?

asfimport commented 11 years ago

Shai Erera (@shaie) (migrated from JIRA)

Maybe a List instead of an array would be even better

The thing is, I use two parallel arrays to sort the documents (docs and values) and SorterTemplate swaps entries in the arrays, so I think that kinda means we need to load everything into memory (well, certainly we cannot swap entries on disk)? But this is just the initial loading used to determine the docs order ... and also, someone can implement a Sorter.oldToNew without loading anything into RAM, but I doubt if anyone will.

As for IndexSorter utility, I thought about adding main(), but then sortField is not enough (only works for NumericDV). On the other hand, I do get your point that there's not much use for a Java API that takes Directory-ies... and I don't want to add API that takes reader/writer and only calls .addIndexes. So one option is to remove the class, but still keep a test around which does the addIndexes to make sure it works.

I don't want however to add a main that is limited to NumericDV ... and I do think that stored fields / payload value are viable options. At least, stored field is, I can let go of the payload, if that really bothers anyone.