apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

Replace our TimSort with JDK TimSort [LUCENE-9632] #10671

Open asfimport opened 3 years ago

asfimport commented 3 years ago

JDK 11 has a TimSort with very similar lineage to our TimSort (written by Josh Bloch, based on Tim Peter's python list sort), we should consider replacing our fork with a version that we don't have to maintain.


Migrated from LUCENE-9632 by Mike Drob (@madrob), updated Jan 23 2021

asfimport commented 3 years ago

Mike Drob (@madrob) (migrated from JIRA)

@uschindler - is there some historical subtlety that I am not aware of that should prevent us from doing this?

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Not sure why you ask me, but the Timsort implementation is not only used for standard sorting of lists and arrays. Lucene has its own sorting infrastructure to allow sorting multiple arrays and lists in parallel (the first array is sorted and the parallel ones are getting same element swaps like the first one). With plain Java apis that's not possible.

In addition: Arrays/Collections.sort() makes clones of the objects instead of sorting in place.

asfimport commented 3 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 to drop the ArrayUtil/CollectionUtil tim sorts, but like Uwe said we still need TimSorter.

In addition: Arrays/Collections.sort() makes clones of the objects instead of sorting in place.

I know it used to be true for Collections.sort() but I don't think this is the case anymore, at least not with ArrayList.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Actually the ArrayUtils and CollectionUtils timSort methods are not special, they just use the Sorter interface implementation TimSorter to sort the array. So removing them is not removing any real code, just some getters/setters for array elements :-) If we remove them, we should also remove all other sorting methods like introSort or others!? We won't do this because every sort algorithm is used for different purposes in Lucene!

So I tend to keep those utility methods, as we exactly know what sort algorithm is used and the code is tested. -1 to remove.

asfimport commented 3 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Actually the availability of these methods are good to test the TimSorter implementation of the Sorter interface in Lucene for correctness and compare them to JDK.

asfimport commented 3 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I just found a new use for IntroSorter - please don't remove! As for the util methods, they're not doing any harm? I'd be -0 to migrate JDK impl for those cases where we can; I mean sure, why not, but on the other hand if it's not broke, why fix it? The test coverage is a benefit, although we could always find another way, but like Uwe points out, we wouldn't really be saving any maintenance costs, since we still need to maintain the core algorithm for the parallel sorting case.

asfimport commented 3 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We should at least run luceneutil to check the performance impact here before switching?  Sorting is such a heavily used and finely tuned part of Lucene that it could have a needle moving impact, even if the core big-oh is the same since it is the same fundamental algorithm.

asfimport commented 3 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Last I saw, we have these sort implementations but not enough javadocs on the circumstances in which they are best used (or not used), and furthermore why we even have them at all (relative to JDK impls).