apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

Remove deprecated BytesRef#getUTF8SortedAsUTF16Comparator(); remove natural comparator in favour of Java 8 one [LUCENE-7053] #8109

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

Followup from #8108: This removes the legacy, deprecated getUTF8SortedAsUTF16Comparator() in the BytesRef class. I know originally we added the different comparators to be able to allow the index term dict to be sorted in different order. This never proved to be useful, as many Lucene queries rely on the default order. The only codec that used another byte order internally was the Lucene 3 one (but it used the unicode spaghetti algorithm to reorder its term enums at runtime).

This patch also removes the BytesRef-Comparator completely and just implements compareTo. So all code can rely on natural ordering.

This patch also cleans up other usages of natural order comparators, e.g. in ArrayUtil, because Java 8 natively provides a comparator.


Migrated from LUCENE-7053 by Uwe Schindler (@uschindler), resolved Feb 29 2016 Attachments: LUCENE-7053.patch (versions: 4) Linked issues:

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Patch (copy from LUCENE-7052).

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

There is a bit code duplication in both tests (sorting Strings in code point order), should we maybe move the new comparator to TestUtil?

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to the patch.

You can also fix TestUnicodeUtil's custom String -> int[] code points logic maybe?

There is a bit code duplication in both tests (sorting Strings in code point order), should we maybe move the new comparator to TestUtil?

+1

We can take this further, e.g. I grep'd for places calling BytesRef.getUTF8SortedAsUnicodeComparator and it turns up silliness in BlockTermsReader that should just be invoking BytesRef.compareTo directly instead, I think?

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We can take this further, e.g. I grep'd for places calling BytesRef.getUTF8SortedAsUnicodeComparator and it turns up silliness in BlockTermsReader that should just be invoking BytesRef.compareTo directly instead, I think?

Yeah. As said, we may not remove the comparator completely, but we should only use it at places where we can't use Comparable<BytesRef> interface that BytesRef implements.

You can also fix TestUnicodeUtil's custom String -> int[] code points logic maybe?

Will check this, too. I am currently investigating it Java 8 already has some Comparator interface somewhere ready-to use. But does not look like that.

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New patch with Mike's suggestions. I added a comparator static final constant to TestUtil with a warning that it is not as effective as it could be (@dweiss comment on LUCENE-7052).

I did not yet look into further cleaning up and removing comparator usage.

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New patch:

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

As we implemented compareTo we could remove the comparator completely. One could use Comparator.naturalOrder() (see https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html#naturalOrder--) instead (naturalOrder is defined to use compareTo. At places like Collections.sort() we could remove the comparator argument completely.

Any comments on this?

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yes, please, and remove BytesRef.COMPARATOR which just duplicates that: naturalOrder() already returns a singleton.

Also in cases like TreeSet creation in the join tests, we should just make new TreeSet<>() and not pass any comparator in at all.

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here is the comparator cleanup. I removed usage of the BytesRef comparator completely (where possible). Otherwise I used Comparator.naturalOrder().

I also removed the ArrayUtil.naturalComparator() method, because it is obsolete with Java 8.

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

All tests pass.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit f48d23cd1448f20fb1b97ec986ded76a04a7075c in lucene-solr's branch refs/heads/master from @uschindler https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=f48d23c

LUCENE-7053: Remove custom comparators from BytesRef class and solely use natural byte[] comparator throughout codebase. It also replaces the natural comparator in ArrayUtil by Java 8's Comparator#naturalOrder().

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 8ffa436f00d24cb45af49160739f71b3654349ce in lucene-solr's branch refs/heads/master from @uschindler https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8ffa436

LUCENE-7053: Move comparator to better place in code; generalize to use CharSequence instead of String

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @uschindler!

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 3c27980c4ae7777716ba74b3a0e2c70b3dd1c1d4 in lucene-solr's branch refs/heads/master from @uschindler https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3c27980

LUCENE-7053: Simplify code to work around Java 8u25 compiler bug