Closed asfimport closed 8 years ago
Michael McCandless (@mikemccand) (migrated from JIRA)
Simple patch.
Robert Muir (@rmuir) (migrated from JIRA)
+1
Dawid Weiss (@dweiss) (migrated from JIRA)
Are you for simplifying the code? Because I doubt it'll be any speed improvement.
Adrien Grand (@jpountz) (migrated from JIRA)
+1 for the simplification
Michael McCandless (@mikemccand) (migrated from JIRA)
Yeah, just shrinking the API.
Steven Rowe (@sarowe) (migrated from JIRA)
Mike, why did you add an implementation of codePoints() instead of using the CharSequence version (returning IntStream) + toArray()? The test passes for me with this patch:
diff --git a/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java b/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
index 50d921b..c3a58ff 100644
--- a/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
+++ b/lucene/core/src/test/org/apache/lucene/util/TestBytesRefHash.java
`@@` -168,15 +168,6 `@@` public class TestBytesRefHash extends LuceneTestCase {
}
}
- private static int[] codePoints(String input) {
- int length = Character.codePointCount(input, 0, input.length());
- int word[] = new int[length];
- for (int i = 0, j = 0, cp = 0; i < input.length(); i += Character.charCount(cp)) {
- word[j++] = cp = input.codePointAt(i);
- }
- return word;
- }
-
/**
* Test method for
* {`@link` org.apache.lucene.util.BytesRefHash#sort()}.
`@@` -191,8 +182,8 `@@` public class TestBytesRefHash extends LuceneTestCase {
SortedSet<String> strings = new TreeSet<>(new Comparator<String>() {
`@Override`
public int compare(String a, String b) {
- int[] aCodePoints = codePoints(a);
- int[] bCodePoints = codePoints(b);
+ int[] aCodePoints = a.codePoints().toArray();
+ int[] bCodePoints = b.codePoints().toArray();
for(int i=0;i<Math.min(aCodePoints.length, bCodePoints.length);i++) {
if (aCodePoints[i] < bCodePoints[i]) {
return -1;
Dawid Weiss (@dweiss) (migrated from JIRA)
Neither is actually pretty as the treeset invokes a comparator multiple times for the same string, causing multiple identical string-int[] conversions along the way. This is test-method only though, so it doesn't matter much.
Uwe Schindler (@uschindler) (migrated from JIRA)
Hi Mike, I know originally we added the different comparators to be able to allow the index term dict to be sorted in different order. This never prooved to be useful, as many Lucene queries rely on the default order. The only codec that used another byte order internally was the Lucene 3 one (but it used the unicode spaghetti algorithm to reorder its term enums at runtime). As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove the comparators at all and just implement BytesRef.compareTo() and use that one for sorting?
I checked the code: utf8SortedAsUTF16SortOrder is only used in TSTLookup nowhere else anymore (except some test that check alternative sorts - those can be removed).
As a first step I changed the BytesRef code to no longer use inner classes and instead use a lambda to define the comparators. But I'd suggest to remove at least the UTF-16 one completely and move it as private impl detail to TSTLookup (as only used there).
FYI: The lambda has no speed impact because it is called only once and internally compiles to a class file that implements Comparator. It just looks nicer than the horrible comparator classes
Uwe Schindler (@uschindler) (migrated from JIRA)
Oh this was already committed. Maybe open another issue to remove the obsolete comparator and convert to lambda.
Michael McCandless (@mikemccand) (migrated from JIRA)
Mike, why did you add an implementation of codePoints() instead of using the CharSequence version (returning IntStream) + toArray()?
Oh, because I didn't even know about CharSequence.codePoints
!
+1 to your patch, thanks.
Neither is actually pretty as the treeset invokes a comparator multiple times for the same string, causing multiple identical string-int[] conversions along the way. This is test-method only though, so it doesn't matter much.
It's definitely inefficient (converting to a sortable key on every comparison), but it keeps the code simple, which I think is usually the right tradeoff for a test case?
As this is now all gone, I'd suggest to also remove the utf8AsUtf16 comparator. Mabye remove the comparators at all and just implement BytesRef.compareTo() and use that one for sorting?
+1, that sounds awesome!
Uwe Schindler (@uschindler) (migrated from JIRA)
+1 to @sarowe change. We are on Java 8, so use Java 8 methods.
Uwe Schindler (@uschindler) (migrated from JIRA)
Mike, should I open a new issue for the comparator cleanups. I have the patch almost ready?
Michael McCandless (@mikemccand) (migrated from JIRA)
TestUnicodeUtil.testUTF8toUTF32
(and I'm sure other places) can also cutover to CharSequence.codePoints
.
Michael McCandless (@mikemccand) (migrated from JIRA)
Mike, should I open a new issue for the comparator cleanups. I have the patch almost ready?
Yes please!!
Uwe Schindler (@uschindler) (migrated from JIRA)
Hi Mike, here is a patch, moving the UTF-16 comparator away and clean up code.
I also changed the TreeSet comparator to a lambda using CharSequence#codePoints()
because I hit the same in another test :-)
Uwe Schindler (@uschindler) (migrated from JIRA)
I wil open now other issue with the attached patch. As the @sarowe suggestion is implemented there, we can leave THIS issue closed.
Uwe Schindler (@uschindler) (migrated from JIRA)
New issue: #8109
Chris M. Hostetter (@hossman) (migrated from JIRA)
I'm confused ... this issue is marked resolved, and Uwe said "Oh this was already committed." but no commits are linked to it, not does it have any subtasks.
can someone clarify when/how this was resolved?
Steven Rowe (@sarowe) (migrated from JIRA)
AFAICT Mike forgot to include the JIRA in his log message for this commit (from the email sent to commits@l.a.o):
Repository: lucene-solr
Updated Branches:
refs/heads/master 70440bbbd -> 126ac9a5f
BytesRefHash.sort always sorts in unicode order
Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/126ac9a5
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/126ac9a5
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/126ac9a5
Branch: refs/heads/master
Commit: 126ac9a5fe00fbbc6870ef25ae3fc6af6cd7c557
Parents: 70440bb
Author: Mike McCandless <mikemccand@apache.org>
Authored: Sat Feb 27 10:26:20 2016 -0500
Committer: Mike McCandless <mikemccand@apache.org>
Committed: Sat Feb 27 10:26:20 2016 -0500
Michael McCandless (@mikemccand) (migrated from JIRA)
Sorry, yeah I forgot the issue number! Thanks for linking it @sarowe!
Today
BytesRefHash.sort
takes a customComparator
but we always pass itBytesRef.getUTF8SortedAsUnicodeComparator()
.Migrated from LUCENE-7052 by Michael McCandless (@mikemccand), resolved Feb 27 2016 Attachments: LUCENE-7052.patch, LUCENE-7052-cleanup1.patch (versions: 2) Linked issues:
8109