apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Re-add SorterTemplate and use it to provide fast ArraySorting and replace BytesRefHash sorting [LUCENE-2719] #3793

Closed asfimport closed 13 years ago

asfimport commented 13 years ago

This patch adds back an optimized and rewritten SorterTemplate back to Lucene (removed after release of 3.0). It is of use for several components:

SorterTemplate provides quickSort and mergeSort algorithms. Internally for short arrays, it automatically chooses insertionSort (like JDK's Arrays). The quickSort algorith was copied modified from old BytesRefHash. This new class only shares MergeSort with the original CGLIB SorterTemplate, which is no longer maintained.


Migrated from LUCENE-2719 by Uwe Schindler (@uschindler), resolved Oct 27 2010 Attachments: LUCENE-2719.patch (versions: 5), LUCENE-2719-CollSupport.patch (versions: 3), LUCENE-2719-final-3x.patch, LUCENE-2719-final-trunk.patch Linked issues:

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Attached you find the patch. Robert offered to do benchmarks with Automaton.

The patch can be applied to a clean checkout, you no longer need to svn copy old SorterTemplate, as this is a almost complete rewrite.

This patch removes the CHANGES.txt entry for 3.x, as it readds the class. If we don't merge this to 3.x, the CHANGES should be reverted. As Lucene uses Arrays.sort(Object[]) which is slow at other places, I recommend to add it also to 3.x.

Please test the stuff with large -Dtests.multiplier! Maybe also verify my modified quickSort!

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Using this class we can look for more useless quickSort code duplication. One is e.g. in DocFieldProcessorPerThread. Maybe more of them.

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

This also applies to Solr (Yonik?).

Yep - Solr also has it's own version of quicksort - PrimUtils.sort() to deal with sorting indexes (an int[]) instead of objects (parallel array sorting).

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The quickSort in DocFieldProcessorPerThread also converted to ArrayUtil.quickSort().

Yonik: I will take care of PrimUtils!

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New patch:

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is great Uwe!

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Modified all sorts to use >>>1 instead /2 (have no real opinion about that).

Yep - Solr also has it's own version of quicksort - PrimUtils.sort() to deal with sorting indexes (an int[]) instead of objects (parallel array sorting).

According to java.util.Arrays javadoc: The sorting algorithm is a tuned quicksort, adapted from Jon L. Bentley and M. Douglas McIlroy's "Engineering a Sort Function", Software-Practice and Experience, Vol. 23(11) P. 1249-1265 (November 1993). This algorithm offers n*log(n) performance on many data sets that cause other quicksorts to degrade to quadratic performance.

If I change to SorterTemplate, we will degrade to good old quicksort. We could also upgrade SorterTemplate to use this algo, but I am not sure if thats easy because SorterTemplate only allows swap(index, index) and compare(index, index). But we cannot retrieve e.g. the pivot value. This was also one problem in porting BytesRefHash's quicksort, as it used the value of the pivot (this is why there is an additional check below the commented out assert in the current patch's algo).

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I tested indexing first 10M wikipedia 1KB docs and running various queries and found no real perf change, or at least less than the noise in the test...

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Latest updates.

There is also an additional patch which provides CollectionUtil, now also supporting in-place collection sorts which is much more perfromant for smaller collections. Collections.sort() in JDK does copy the List into array and calls Arrays.sort() which itsself does clone the array and after that copies the arraycontents using a ListIterator back to the List.

Before committing this, can somebody look into the o.a.l.index package, because I replaced some sorts for field names there. For commit points i used MergeSort, as I am not sure if it should be stable.

So just a confirmation, if we need stable sort for Indexer and Commit points would be fine.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert tested yesterday and found out that SorterTemplate.quickSort is not as efficient as it could be. The general problem is:

I changed SorterTemplate to look more like FieldComparator known from search. It now has not only swap(index1,index2) and compare(index1,index2), it also gets setPivot(index) [stores index' value as pivot] and comparePivot(index) [compares given index' value with previously stored pivot value]. Now the quicksort algorithm is identical to the one seen everywhere in Lucene before. We can now also implement the optimized one from harmony also seen in Solr's PrimUtil. I will look into this, if it makes sense (it makes not always sense as comparing and swapping is more intensive for non-native values!).

This has also some improvements to BytesRefHash, as there are less de-references of BytesRefs, because the main quickSort loop only compares an index with the in setPivot dereferenced BytesRefs. Before it did this on every compare step!

Robert: Can you supply your benchmark? Or test again :-)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

After spending the evening with performance tests on BytesRefHash and Fuzzy automatons I cam to the following conclusion, finalized in this hopefully last patch:

public void testSortPerf() {
  long start = System.currentTimeMillis();
  BytesRef ref = new BytesRef();
  for (int j = 0; j < 200 * RANDOM_MULTIPLIER; j++) {
    for (int i = 0; i < 1797; i++) {
      String str;
      do {
        str = _TestUtil.randomRealisticUnicodeString(random, 1000);
      } while (str.length() == 0);
      ref.copy(str);
      hash.add(ref);
    }
    hash.sort(BytesRef.getUTF8SortedAsUTF16Comparator());
    hash.clear();
    hash.reinit();
  }
  System.out.println("time: "+(System.currentTimeMillis()-start));
}

I will commit this patch, which now also makes insertionSort public in SorterTemplate, ArrayUtil and CollectionUtil tomorrow. I tend to also commit this to 3.x (merged to BytesRefHash-similar class from 3.x). This is why the CHANGES.txt removes the SorterTemplate removal message (may need to be modified, because SorterTemplate changed API). If we will only commit to trunk, CHANGES would keep unchanged.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

After Robert mentioned the strange comparator in the above benchmark: It is just a leftover from the original testSort() test which needed that special order, because it compared the sorted BytesRefHash using a TreeSet of UTF16 strings. For the benchmark the comparator has no real effect.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Final patch, will get committed... now. It adds some contrib changes and changes.txt/notice.txt and javadocs.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed trunk revision: 1027998

Now working on 3.x

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

filterdiff'ed patch for 3.x branch - we need that for commit mails, too. The changes in BytesRefHash are merged over to TermsHashPerField. This patch also removes useless synchronization!

After this also 3.x gets the imporved terms sorting and reduced code duplication.

I will commit soon.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed 3.x branch revision: 1028042

Thanks to all for performance testing!

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1