apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.65k stars 1.03k forks source link

BytesRefHash.sort should always sort in unicode code point order [LUCENE-7052] #8108

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

Today BytesRefHash.sort takes a custom Comparator but we always pass it BytesRef.getUTF8SortedAsUnicodeComparator().


Migrated from LUCENE-7052 by Michael McCandless (@mikemccand), resolved Feb 27 2016 Attachments: LUCENE-7052.patch, LUCENE-7052-cleanup1.patch (versions: 2) Linked issues:

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Simple patch.

asfimport commented 8 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Are you for simplifying the code? Because I doubt it'll be any speed improvement.

asfimport commented 8 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 for the simplification

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yeah, just shrinking the API.

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Mike, why did you add an implementation of codePoints() instead of using the CharSequence version (returning IntStream) + toArray()? The test passes for me with this patch:

diff --git a/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java b/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
index 50d921b..c3a58ff 100644
--- a/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
+++ b/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
`@@` -168,15 +168,6 `@@` public class TestBytesRefHash extends LuceneTestCase {
     }
   }

-  private static int[] codePoints(String input) {
-    int length = Character.codePointCount(input, 0, input.length());
-    int word[] = new int[length];
-    for (int i = 0, j = 0, cp = 0; i < input.length(); i += Character.charCount(cp)) {
-      word[j++] = cp = input.codePointAt(i);
-    }
-    return word;
-  }
-
   /**
    * Test method for
    * {`@link` org.apache.lucene.util.BytesRefHash#sort()}.
`@@` -191,8 +182,8 `@@` public class TestBytesRefHash extends LuceneTestCase {
       SortedSet<String> strings = new TreeSet<>(new Comparator<String>() {
           `@Override`
           public int compare(String a, String b) {
-            int[] aCodePoints = codePoints(a);
-            int[] bCodePoints = codePoints(b);
+            int[] aCodePoints = a.codePoints().toArray();
+            int[] bCodePoints = b.codePoints().toArray();
             for(int i=0;i<Math.min(aCodePoints.length, bCodePoints.length);i++) {
               if (aCodePoints[i] < bCodePoints[i]) {
                 return -1;
asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Neither is actually pretty as the treeset invokes a comparator multiple times for the same string, causing multiple identical string-int[] conversions along the way. This is test-method only though, so it doesn't matter much.

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Mike, I know originally we added the different comparators to be able to allow the index term dict to be sorted in different order. This never prooved to be useful, as many Lucene queries rely on the default order. The only codec that used another byte order internally was the Lucene 3 one (but it used the unicode spaghetti algorithm to reorder its term enums at runtime). As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove the comparators at all and just implement BytesRef.compareTo() and use that one for sorting?

I checked the code: utf8SortedAsUTF16SortOrder is only used in TSTLookup nowhere else anymore (except some test that check alternative sorts - those can be removed).

As a first step I changed the BytesRef code to no longer use inner classes and instead use a lambda to define the comparators. But I'd suggest to remove at least the UTF-16 one completely and move it as private impl detail to TSTLookup (as only used there).

FYI: The lambda has no speed impact because it is called only once and internally compiles to a class file that implements Comparator. It just looks nicer than the horrible comparator classes

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Oh this was already committed. Maybe open another issue to remove the obsolete comparator and convert to lambda.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Mike, why did you add an implementation of codePoints() instead of using the CharSequence version (returning IntStream) + toArray()?

Oh, because I didn't even know about CharSequence.codePoints!

+1 to your patch, thanks.

Neither is actually pretty as the treeset invokes a comparator multiple times for the same string, causing multiple identical string-int[] conversions along the way. This is test-method only though, so it doesn't matter much.

It's definitely inefficient (converting to a sortable key on every comparison), but it keeps the code simple, which I think is usually the right tradeoff for a test case?

As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove the comparators at all and just implement BytesRef.compareTo() and use that one for sorting?

+1, that sounds awesome!

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

+1 to @sarowe change. We are on Java 8, so use Java 8 methods.

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Mike, should I open a new issue for the comparator cleanups. I have the patch almost ready?

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

TestUnicodeUtil.testUTF8toUTF32 (and I'm sure other places) can also cutover to CharSequence.codePoints.

asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Mike, should I open a new issue for the comparator cleanups. I have the patch almost ready?

Yes please!!

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Mike, here is a patch, moving the UTF-16 comparator away and clean up code.

I also changed the TreeSet comparator to a lambda using CharSequence#codePoints() because I hit the same in another test :-)

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I wil open now other issue with the attached patch. As the @sarowe suggestion is implemented there, we can leave THIS issue closed.

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New issue: #8109

asfimport commented 8 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I'm confused ... this issue is marked resolved, and Uwe said "Oh this was already committed." but no commits are linked to it, not does it have any subtasks.

can someone clarify when/how this was resolved?

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

AFAICT Mike forgot to include the JIRA in his log message for this commit (from the email sent to commits@l.a.o):

Repository: lucene-solr
Updated Branches:
 refs/heads/master 70440bbbd -> 126ac9a5f

BytesRefHash.sort always sorts in unicode order

Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/126ac9a5
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/126ac9a5
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/126ac9a5

Branch: refs/heads/master
Commit: 126ac9a5fe00fbbc6870ef25ae3fc6af6cd7c557
Parents: 70440bb
Author: Mike McCandless <mikemccand@apache.org>
Authored: Sat Feb 27 10:26:20 2016 -0500
Committer: Mike McCandless <mikemccand@apache.org>
Committed: Sat Feb 27 10:26:20 2016 -0500
asfimport commented 8 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Sorry, yeah I forgot the issue number! Thanks for linking it @sarowe!