apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Change Term to use bytes [LUCENE-2514] #3588

Closed asfimport closed 13 years ago

asfimport commented 14 years ago

in #3500, the sort order was changed to codepoint order.

unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16]. So MultiTermQuery, etc (especially its priority queues) are currently wrong.

By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using strange string encodings.


Migrated from LUCENE-2514 by Robert Muir (@rmuir), resolved Feb 28 2011 Attachments: LUCENE-2514_collatedrange.patch (versions: 3), LUCENE-2514_qp.patch, LUCENE-2514.patch (versions: 15), LUCENE-2514-MTQPagedBytes.patch (versions: 3), LUCENE-2514-surrogates-dance.patch Linked issues:

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is one option: use bytesref behind the scenes but also support String ctors like we do today.

i tried the 'hard cutover' mike suggested, but this is a massive change and I think typically users will just be using String.

personally I don't see the harm in supporting Strings this way, any perf-sensitive stuff creating a lot of Term objects (e.g. MultiTermQuery) should be using BytesRef anyway.

one test fails: the preflex TestSurrogates... does this test create terms with unpaired surrogates? If so this would explain the failures I think.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

one test fails: the preflex TestSurrogates... does this test create terms with unpaired surrogates? If so this would explain the failures I think.

Oh man not that one!!

It is not supposed to create unpaired surrogates (it uses _TestUtil.makeRandomUnicodeString). I'll dig...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

its my fault. preflex codec still uses Term.compareTo (and this is expected) in several places.

so, i need to add back UTF8SortedAsUTF16Comparator to bytesref, and add a utf-16 compare to Term that uses it for the old behavior (or accomplish the equiv logic somewhere else)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i tried to fix the preflex problem like this, but it didnt work... maybe something (sort or other) is still relying on the old utf-16 compareTo() that i didnt find????

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ahh sorry I do indeed seek to an unpaired high surrogate – easy to fix (pair it up w/ minimum low surrogate).

But, there's another problem, which is that the pre-flex codec uses Term.compareTo, and it needs that to be based on UTF16. We can just fix the preflex codec to use its own (UTF8inUTF16order) comparator. I think once we fix those two the test should pass again (crossing fingers...).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch to fix (I think) the unpaired surrogate passed to Term... not yet tested.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok i combined my patch with yours, and fixed a typo (thats what i get for copy/paste)...

i think it might be ok

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Looks good! Might want to just add a doc on the Term constructor that the provided BytesRef is not copied, and should not be modified after construction (i.e. if you create a Term w/ the BytesRef provided by a TermsEnum or something, you're in trouble).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Might want to just add a doc on the Term constructor that the provided BytesRef is not copied, and should not be modified after construction (i.e. if you create a Term w/ the BytesRef provided by a TermsEnum or something, you're in trouble).

Yeah, I agree. I'd also like to improve some of the perf sensitive spots to never create strings at all before committing (e.g. MultiTermQuery rewrites).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here's an updated patch with yonik's suggested note (might need wording changes).

Also i started converting some various code to use the bytes(), queries and such...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike, can you review some of the preflex changes to bytes() in the patch? I wonder if in some of these places we even need 'scratchBytesRef' at all now...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

preflex changes look good.

I think with this we can eliminate the scratchBytesRef entirely in PreFlexFields and instead just use termEnum.term().bytes()!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert: I can take MTQ tomorrow. I think we can remove the whole backwards stuff from MTQ and change completely to BytesRef (internally). This makes the steps TermsEnum (bytes) -> TermCollector -> TermQuery which converts all the time simplier. The collector abstract class in the MTQ rewrites will be much nicer. I can also remove the rest of pre-BoostAttribute stuff from TopTermsRewrite.

I will go to sleep now, tomorrow more...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe thanks, I would prefer if you did MTQ too.

I agree it should completely use BytesRef... i think it should also not create new Terms even until rewriting.

For example, currently the priority queue in TopTerms does BytesRef -> String conversion and creates a new Term for each add, but this might be entirely useless as it could fall off the pq, so i think its ScoreTerm or whatever should not hold term at all but just bytesref.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

For example, currently the priority queue in TopTerms does BytesRef -> String conversion and creates a new Term for each add, but this might be entirely useless as it could fall off the pq, so i think its ScoreTerm or whatever should not hold term at all but just bytesref

Exactly! We removed support for TermEnum (without s), so field name is never null. You can always take the field from the MTQ when building TermQueries. And for that we create the Term using new Term(field, BytesRef) or with the non-interning placeholder (see also below). This makes MTQ much simplier, I started to do it...

By the way: we could remove all String interning for field names now? We don't compare fields anymore?

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

By the way: we could remove all String interning for field names now? We don't compare fields anymore?

Yeah, I noticed that too... I think so.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods?

By the way: we could remove all String interning for field names now? We don't compare fields anymore?

We should be careful with this!

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods?

I'll fix these. I didnt notice them since they don't use Term, but String directly. The patch is also generally incomplete in other ways... mainly I am searching on uses of Term.text() and such to ensure I do not introduce performance regressions.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here robert's patch with MTQ changed.

It currently still uses placeholderTerms to not need to intern every time. If we remove string interning from Term, we can replace this by simple new Term() in MTQ.

I delayed cloning of BytesRef until the BytesRef is put into a TermQuery or PQ or whenever it is set aside. But it no longer clones it e.g. if the term is never accepted by the PQ. Also the PQ reuses its ScoreTerm instances and so, the term bytes are simply copied over :-)

I also removed a Java 1.6 interface override - the Generics Policeman gives a ticket! I don't understand where those come from, Java 1.6 should also fail to compile as the ant build uses -source 1.5...?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I also removed a Java 1.6 interface override - the Generics Policeman gives a ticket! I don't understand where those come from, Java 1.6 should also fail to compile as the ant build uses -source 1.5...?

Sorry, i am on a mac right now and i dont think i configured it correctly... (though ant test never complained... this is wierd). Normallly my IDE does not generate this... but at the same time it is something we should fix the build for, as i think Eclipse will generate these by default if configured for Java 6, which solr uses.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

mainly I am searching on uses of Term.text() and such to ensure I do not introduce performance regressions.

I would temporary remove all String methods from Term and try to compile core. If this works you should find all perf regressions in both directions.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is an updated patch, with uwe's changes, plus some additional conversions like TermsFilter and FieldCacheTermsFilter

The range ones are a bit tricky, mainly because they work with collators with makes no sense with byte[]. but if collator is null then byte[] makes sense.

the collator stuff is silly in a way, if we switch collation to byte[] it will use less ram than even the original String in lucene 3.x, and sort much faster.

one option might be to split the collating range stuff into its own classes or something, i think its a bit confusing how collation is mixed in with 'binary' order... it tricks you into thinking the 'default' is UCA or default locale or something, but is neither.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I would temporary remove all String methods from Term and try to compile core. If this works you should find all perf regressions in both directions.

i did this, but there is a lot of unrelated stuff that uses Term string methods and is perfectly fine. especially many tests and things like queryparser.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I would temporary remove all String methods from Term and try to compile core. If this works you should find all perf regressions in both directions.

i did this, but there is a lot of unrelated stuff that uses Term string methods and is perfectly fine. especially many tests and things like queryparser.

Yeah tests can use string methods, i meant only core classes should compile without Term's String methods.

one option might be to split the collating range stuff into its own classes or something, i think its a bit confusing how collation is mixed in with 'binary' order... it tricks you into thinking the 'default' is UCA or default locale or something, but is neither.

I was always thinking about factoring out collation stuff from TermRangeQuery. I would like to have a pure TermRangeQuery without any collations things and maybe a CollationTermRangeQuery or whatever...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yeah tests can use string methods, i meant only core classes should compile without Term's String methods.

Ok, i will have another look at this (ignoring tests and queryparser and such). Maybe i will find something interesting.

I was always thinking about factoring out collation stuff from TermRangeQuery. I would like to have a pure TermRangeQuery without any collations things and maybe a CollationTermRangeQuery or whatever...

yeah, i think this would be better in the future too. but for now, i can work with what we have. it just means lots of things like String lowerBound; /* only used when collator != null */ and such

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here an improvement of the MTQ only patch:

The auto MTQ rewrite mode now collects all terms into a PagedBytes until cutoff. This maybe better memory-wise, not sure if this is really needed. For me it was just some usage training :-) and the number of objects is lower, especially for large cutoff numbers.

Mike?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

edit: i added newlines and reformatted so the issue is readable. sorry.

We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods?

This actually makes the api ugly and problematic for ctors, because of open-ended values (null). it would be nice to avoid requiring users to cast here...


reference to TermRangeQuery is ambiguous, both method 
TermRangeQuery(java.lang.String,java.lang.String,java.lang.String,boolean,boolean) 
in org.apache.lucene.search.TermRangeQuery and method 
TermRangeQuery(java.lang.String,org.apache.lucene.util.BytesRef,
org.apache.lucene.util.BytesRef,boolean,boolean) 
in org.apache.lucene.search.TermRangeQuery match
    [javac]     TermRangeQuery query = new TermRangeQuery("content", null, null, true, true);

attached is my updates to TermRangeQuery etc before realizing this. it also includes updated FieldCacheRangeQuery complete with generics violations.

maybe when the policeman wakes up he will have ideas. i don't want the api to be ugly.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Improved version of PagedBytes MTQ cut off collector. It adds a method to PagedBytes.Reader to sequentially read all BytesRefs without a separate offset array.

Mike, if you are fine with that, we should add this to the global patch for this issue.

We should maybe also fix PagedBytes.freeze(boolean), as the parameter is currently unused. For the use case here, reallocating the last block is not really needed, it can stay as is. Maybe we should readd support for this parameter.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

This patch only adds a correct trim to PagesBytes.freeze(boolean). The impl was missing, but is important here for performance.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, thanks for the MTQ updates. i will merge them with my previous patch (not my latest broken one!)

After looking at RangeQuery/Filter, i think i would prefer to make a subtask to refactor the collation part out. For example, we could make CollatedRangeQuery.

In my opinion this is really very different from an ordinary binary-ordered range query and is confusing to both users and the code to be mixed in. I think we should consider fixing the API by splitting these and adding documentation to the migrate.txt

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

+1

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch:

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok i added a couple more optimizations to preflex, to avoid creating so many strings.

i still don't like all the conversions for dealing with terms: is the problem just the shared prefixes and surrogate dance? I wonder if we can do something tricky to avoid this.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Converted the rest of TestSurrogates (removed FieldAnText class, now direktly uses Term, as Term now supports BytesRef directly). Test respects verbose param.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Uwe. I added some additional changes:

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

the term vectors api should be in sync with Term, but still uses String. this patch switches them to bytesref too.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Updated patch with recent trunk commits (TestTV, TestSurrogates)

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We should commit this soon and solve the rest separately?

Robert? Mike?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't have a real computer for a few days, so take it if you want!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I take it and will commit it tomorrow.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed this patch revision: 960484

I keep this open, as more improvements may be added (e.g. TermRangeQuery)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

With Term as byte, and tokenstreams can encode terms to byte however they want with TermToBytesRefAttribute, it makes sense for queryparsers to consume bytes like the indexer, and build terms without an intermediate String.

This way non-unicode terms (e.g. collation) work as expected.

This patch updates the queryparsers, except for contrib/queryparser (which will be more serious and cause API changes), and the range query building AnalyzingQueryParser (we need to fix TermRangeQuery first).

All tests pass.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'd like to commit this queryparser patch tomorrow if no one objects. Then I think we should look at range query, etc.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to commit

This would also mean the BOCU-1 encoding could be used drop-in w/ QueryParser for basic (Term, Phrase) queries right?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

This would also mean the BOCU-1 encoding could be used drop-in w/ QueryParser for basic (Term, Phrase) queries right?

Yes, they should then work (or there is a bug!)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Committed LUCENE-2514_qp.patch revision 966254

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

In order to move forward with collation-keys-as-byte and other improvements, we need to fix TermRangeQuery. But this is difficult when the String-only Collation support exists mixed with the byte-order TermRangeQuery...

As discussed previously on this issue, here is a patch that splits this into a separate CollatedTermRangeQuery/Filter

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

we could make things even better by removing queryparser's get/setRangeCollator method. instead in its place, it could have something like a boolean 'analyzeRangeQueries' ? it could then analyze the endpoints (producing byte collation keys) and use a regular fast term range query.

I think its good to support collation order for people who want it, but we should make it easy to do things the fast way, right now we make it easy to do things the slow way and hard to do it fast.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

+1

I agree we have it backwards now. The "obvious" approach should be the performant one.