JoinUtil support for NUMERIC docValues fields [LUCENE-5868]

asfimport commented 10 years ago

while polishing SOLR-6234 I found that JoinUtil can't join int dv fields at least. I plan to provide test/patch. It might be important, because Solr's join can do that. Please vote if you care!

Migrated from LUCENE-5868 by Mikhail Khludnev (@mkhludnev), 1 vote, resolved Dec 10 2015 Attachments: LUCENE-5868.patch (versions: 6), LUCENE-5868-5x.patch (versions: 2), LUCENE-5868-lambdarefactoring.patch (versions: 2), qtj.diff Linked issues:

SOLR-8395
- SOLR-6234

asfimport commented 9 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

It seems nobody cares. Closing so far.

asfimport commented 9 years ago

marc schipperheyn (migrated from JIRA)

I'd vote for it.

asfimport commented 9 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

marc schipperheyn what's your usecase? why you can't use SORTED? do you need to join cross cores? Have you had a look at OrdinalMap join?

asfimport commented 8 years ago

Alexey Zelin (migrated from JIRA)

Numeric type processing added during join query build for single value case and int values. See attached qty.diff

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

Alexey Zelin the patch make sense please pay attention to the following points:

http://wiki.apache.org/lucene-java/HowToContribute

and save them into the LUCENE-NNNN.patch file.

bq. Read the patch file. Make sure it includes ONLY the modifications required to fix a single issue.
I suppose we need to cover all existing cases, ie. the scope of the issue should include: TermsCollector.MV, TermsCollector.SV, TermsWithScoreCollector.MV, TermsWithScoreCollector.MV.Avg, TermsWithScoreCollector.SV, TermsWithScoreCollector.SV.Avg,... yepp too many, I see.
as an idea to avoid copy-paste by bridging different DV types. NumericDocValues can be adapted to BinaryDocValues
such adapter can reuse BytesRefBuilder (giving that BytesRefHash copies bytes)
the same approach can be done with adapting SortedNumericDocValues to SortedSetDocValues
I suppose it's ok to keep it lenient: silently allow to shoot legs by having different DV types across segments.
As I understand, TestJoinUtilInt is just a first scratch. I suppose it's worth to accurately expand existing tests:
- I suppose TestJoinUtil.testSimple() testSimpleWithScoring() you can add from_num to_num fields into sample docs, and randomly switch these fields for passing into createJoinQuery()
- in TestJoinUtilInt you are trying to create numeric DV by setDocValuesType(DocValuesType.NUMERIC);, I don't belive it work, and it's handled by UnInvertingReader in run-time. So, I suggest to add NumericDocValuesField and SortedNumericDocValuesField (as mv case) explicitly. But let's randomly switch to existing approach (just add indexed field and rely on UnInvertingReader) just for smoke testing.

I'll handle as separate issues:

extending TestScoreJoinQPScore.testSimpleWithScoring() for coverage
extending TestJoinUtil.test*ValueRandomJoin() for coverage

Beside of the patch, for further consideration: if we could provide field types by something like Solr Schema/FieldTypes into JoinUtil. such issue would be autodone.

asfimport commented 8 years ago

Alexey Zelin (migrated from JIRA)

Added long values support as well as multi value fields support.

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

I decided to attach a wild lambda refactoring first. There is no change functionally. @martijnvg and all lambda-fans, you are kindly invited to have a look at LUCENE-5868-lambdarefactoring.patch.

LUCENE-5868-lambdarefactoring.patch

```diff Index: lucene/join/src/java/org/apache/lucene/search/join/AbstractTermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/AbstractTermsCollector.java (revision 0) +++ lucene/join/src/java/org/apache/lucene/search/join/AbstractTermsCollector.java (working copy) @@ -0,0 +1,133 @@ +package org.apache.lucene.search.join; + +import java.io.IOException; +import java.util.function.Consumer; +import java.util.function.LongConsumer; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.SimpleCollector; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; +import org.apache.lucene.util.NumericUtils; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +abstract class AbstractTermsCollector extends SimpleCollector { + + @FunctionalInterface + static interface Function { + R apply(LeafReader t) throws IOException ; + } + + protected DV docValues; + private final Function docValuesCall; + + public AbstractTermsCollector(Function docValuesCall) { + this.docValuesCall = docValuesCall; + } + + @Override + protected final void doSetNextReader(LeafReaderContext context) throws IOException { + docValues = docValuesCall.apply(context.reader()); + } + + static Function binaryDocValues(String field) { + return (ctx) -> DocValues.getBinary(ctx, field); + } + static Function sortedSetDocValues(String field) { + return (ctx) -> DocValues.getSortedSet(ctx, field); + } + + static Function numericAsBinaryDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final NumericDocValues numeric = DocValues.getNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp); + + return new BinaryDocValues() { + @Override + public BytesRef get(int docID) { + final long lVal = numeric.get(docID); + coder.accept(lVal); + return bytes.get(); + } + }; + }; + } + + static LongConsumer coder(BytesRefBuilder bytes, NumericType type){ + switch(type){ + case INT: + return (l) -> NumericUtils.intToPrefixCoded((int)l, 0, bytes); + case LONG: + return (l) -> NumericUtils.longToPrefixCoded(l, 0, bytes); + default: + throw new IllegalArgumentException("Unsupported "+type+ + ". Only "+NumericType.INT+" and "+NumericType.LONG+" are supported." + // sadly can't report field name + "Field "+field + ); + } + } + + /** this adapter is quite weird. ords are per doc index, don't use ords across different docs*/ + static Function sortedNumericAsSortedSetDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp); + + return new SortedSetDocValues() { + + private int index = -1; + + @Override + public long nextOrd() { + return index < numerics.count() ? index++ : NO_MORE_ORDS; + } + + @Override + public void setDocument(int docID) { + numerics.setDocument(docID); + index=0; + } + + @Override + public BytesRef lookupOrd(long ord) { + assert ord>=0; + final long value = numerics.valueAt((int)ord); + coder.accept(value); + return bytes.get(); + } + + @Override + public long getValueCount() { + return numerics.count(); + } + }; + }; + } +} Index: lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (revision 1715738) +++ lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (working copy) @@ -1,5 +1,8 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Locale; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -25,9 +28,6 @@ import org.apache.lucene.search.MatchNoDocsQuery; import org.apache.lucene.search.Query; -import java.io.IOException; -import java.util.Locale; - /** * Utility for query time joining. * @@ -72,18 +72,19 @@ Query fromQuery, IndexSearcher fromSearcher, ScoreMode scoreMode) throws IOException { + + TermsCollectorWithScoreInterface termsWithScoreCollector = + TermsWithScoreCollector.createCollector(fromField, multipleValuesPerDocument, scoreMode); + + fromSearcher.search(fromQuery, termsWithScoreCollector); + switch (scoreMode) { case None: - TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument); - fromSearcher.search(fromQuery, termsCollector); - return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms()); + return new TermsQuery(toField, fromQuery, termsWithScoreCollector.getCollectedTerms()); case Total: case Max: case Min: case Avg: - TermsWithScoreCollector termsWithScoreCollector = - TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); - fromSearcher.search(fromQuery, termsWithScoreCollector); return new TermsIncludingScoreQuery( toField, multipleValuesPerDocument, @@ -95,7 +96,7 @@ throw new IllegalArgumentException(String.format(Locale.ROOT, "Score mode %s isn't supported.", scoreMode)); } } - + /** * Delegates to {@link #createJoinQuery(String, Query, Query, IndexSearcher, ScoreMode, MultiDocValues.OrdinalMap, int, int)}, * but disables the min and max filtering. Index: lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (revision 1715738) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (working copy) @@ -19,11 +19,10 @@ import java.io.IOException; +import org.apache.lucene.index.BinaryDocValues; import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.search.SimpleCollector; +import org.apache.lucene.search.LeafCollector; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; @@ -32,19 +31,46 @@ * * @lucene.experimental */ -abstract class TermsCollector extends SimpleCollector { +abstract class TermsCollector extends AbstractTermsCollector { - final String field; + TermsCollector(Function docValuesCall) { + super(docValuesCall); + } + final BytesRefHash collectorTerms = new BytesRefHash(); - TermsCollector(String field) { - this.field = field; - } - public BytesRefHash getCollectorTerms() { return collectorTerms; } + static TermsCollectorWithScoreInterface createAsWithScore(String field, boolean multipleValuesPerDocument) { + return new TermsCollectorWithScoreInterface(){ + + final TermsCollector collector = TermsCollector.create(field, multipleValuesPerDocument); + + @Override + public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { + return collector.getLeafCollector(context); + } + + @Override + public boolean needsScores() { + return collector.needsScores(); + } + + @Override + public BytesRefHash getCollectedTerms() { + return collector.getCollectorTerms(); + } + + @Override + public float[] getScoresPerTerm() { + throw new UnsupportedOperationException("scores are not available for "+collector + + " build for field:"+field); + } + }; + } + /** * Chooses the right {@link TermsCollector} implementation. * @@ -52,55 +78,42 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsCollector} instance */ - static TermsCollector create(String field, boolean multipleValuesPerDocument) { - return multipleValuesPerDocument ? new MV(field) : new SV(field); + static TermsCollector create(String field, boolean multipleValuesPerDocument) { + return multipleValuesPerDocument + ? new MV(sortedSetDocValues(field)) + : new SV(binaryDocValues(field)); } // impl that works with multiple values per document - static class MV extends TermsCollector { - final BytesRef scratch = new BytesRef(); - private SortedSetDocValues docTermOrds; - - MV(String field) { - super(field); + static class MV extends TermsCollector { + + MV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - docTermOrds.setDocument(doc); long ord; - while ((ord = docTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - final BytesRef term = docTermOrds.lookupOrd(ord); + docValues.setDocument(doc); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + final BytesRef term = docValues.lookupOrd(ord); collectorTerms.add(term); } } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - docTermOrds = DocValues.getSortedSet(context.reader(), field); - } } // impl that works with single value per document - static class SV extends TermsCollector { + static class SV extends TermsCollector { - final BytesRef spare = new BytesRef(); - private BinaryDocValues fromDocTerms; - - SV(String field) { - super(field); + SV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - final BytesRef term = fromDocTerms.get(doc); + final BytesRef term = docValues.get(doc); collectorTerms.add(term); } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } } @Override Index: lucene/join/src/java/org/apache/lucene/search/join/TermsCollectorInterface.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsCollectorInterface.java (revision 0) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsCollectorInterface.java (working copy) @@ -0,0 +1,29 @@ +package org.apache.lucene.search.join; + +import org.apache.lucene.search.Collector; +import org.apache.lucene.util.BytesRefHash; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +interface TermsCollectorWithScoreInterface extends Collector { + + BytesRefHash getCollectedTerms() ; + + float[] getScoresPerTerm(); + +} Index: lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (revision 1715738) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (working copy) @@ -1,5 +1,8 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Arrays; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -18,22 +21,18 @@ */ import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.LeafCollector; import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.BytesRefHash; -import java.io.IOException; -import java.util.Arrays; +abstract class TermsWithScoreCollector extends AbstractTermsCollector + implements TermsCollectorWithScoreInterface { -abstract class TermsWithScoreCollector extends SimpleCollector { - private final static int INITIAL_ARRAY_SIZE = 0; - final String field; final BytesRefHash collectedTerms = new BytesRefHash(); final ScoreMode scoreMode; @@ -40,8 +39,8 @@ Scorer scorer; float[] scoreSums = new float[INITIAL_ARRAY_SIZE]; - TermsWithScoreCollector(String field, ScoreMode scoreMode) { - this.field = field; + TermsWithScoreCollector(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall); this.scoreMode = scoreMode; if (scoreMode == ScoreMode.Min) { Arrays.fill(scoreSums, Float.POSITIVE_INFINITY); @@ -50,10 +49,12 @@ } } + @Override public BytesRefHash getCollectedTerms() { return collectedTerms; } - + + @Override public float[] getScoresPerTerm() { return scoreSums; } @@ -70,36 +71,42 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsWithScoreCollector} instance */ - static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { + static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { if (multipleValuesPerDocument) { switch (scoreMode) { case Avg: - return new MV.Avg(field); + return new MV.Avg(sortedSetDocValues(field)); default: - return new MV(field, scoreMode); + return new MV(sortedSetDocValues(field), scoreMode); } } else { switch (scoreMode) { case Avg: - return new SV.Avg(field); + return new SV.Avg(binaryDocValues(field)); default: - return new SV(field, scoreMode); + return new SV(binaryDocValues(field), scoreMode); } } } + + static TermsCollectorWithScoreInterface createCollector(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { + if(scoreMode == ScoreMode.None){ + return TermsCollector.createAsWithScore(field, multipleValuesPerDocument); + }else{ + return TermsWithScoreCollector.create(field, multipleValuesPerDocument, scoreMode); + } + } // impl that works with single value per document - static class SV extends TermsWithScoreCollector { + static class SV extends TermsWithScoreCollector { - BinaryDocValues fromDocTerms; - - SV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + SV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -133,26 +140,23 @@ scoreSums[ord] = current; } break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } - static class Avg extends SV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -187,20 +191,18 @@ } // impl that works with multiple values per document - static class MV extends TermsWithScoreCollector { + static class MV extends TermsWithScoreCollector { - SortedSetDocValues fromDocTermOrds; - - MV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + MV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { @@ -225,29 +227,26 @@ case Max: scoreSums[termID] = Math.max(scoreSums[termID], scorer.score()); break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTermOrds = DocValues.getSortedSet(context.reader(), field); - } - static class Avg extends MV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { ```

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

attaching LUCENE-5868.patch. Glue all stuff together. The significant change is introducing a new signature in JoinUtil, you know why:

  public static Query createJoinQuery(String fromField,
      boolean multipleValuesPerDocument,
      String toField, 
NumericType numericType,
      Query fromQuery,
      IndexSearcher fromSearcher,
      ScoreMode scoreMode) throws IOException

I added existing test in the patch. Test coverage needs to be improved.

Opinions?

LUCENE-5868.patch

```diff Index: lucene/CHANGES.txt =================================================================== --- lucene/CHANGES.txt (revision 1718426) +++ lucene/CHANGES.txt (working copy) @@ -108,6 +108,12 @@ ======================= Lucene 5.5.0 ======================= +New Features + +* LUCENE-5868: JoinUtil.createJoinQuery(..,NumericType,..) query-time join + for LONG and INT fields with NUMERIC and SORTED_NUMERIC doc values. + (Alexey Zelin via Mikhail Khludnev) + API Changes * #7958: Grouping sortWithinGroup variables used to allow null to mean Index: lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (revision 0) +++ lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (working copy) @@ -0,0 +1,136 @@ +package org.apache.lucene.search.join; + +import java.io.IOException; +import java.util.function.LongConsumer; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.SimpleCollector; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; +import org.apache.lucene.util.NumericUtils; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +abstract class DocValuesTermsCollector extends SimpleCollector { + + @FunctionalInterface + static interface Function { + R apply(LeafReader t) throws IOException ; + } + + protected DV docValues; + private final Function docValuesCall; + + public DocValuesTermsCollector(Function docValuesCall) { + this.docValuesCall = docValuesCall; + } + + @Override + protected final void doSetNextReader(LeafReaderContext context) throws IOException { + docValues = docValuesCall.apply(context.reader()); + } + + static Function binaryDocValues(String field) { + return (ctx) -> DocValues.getBinary(ctx, field); + } + static Function sortedSetDocValues(String field) { + return (ctx) -> DocValues.getSortedSet(ctx, field); + } + + static Function numericAsBinaryDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final NumericDocValues numeric = DocValues.getNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new BinaryDocValues() { + @Override + public BytesRef get(int docID) { + final long lVal = numeric.get(docID); + coder.accept(lVal); + return bytes.get(); + } + }; + }; + } + + static LongConsumer coder(BytesRefBuilder bytes, NumericType type, String fieldName){ + switch(type){ + case INT: + return (l) -> NumericUtils.intToPrefixCoded((int)l, 0, bytes); + case LONG: + return (l) -> NumericUtils.longToPrefixCoded(l, 0, bytes); + default: + throw new IllegalArgumentException("Unsupported "+type+ + ". Only "+NumericType.INT+" and "+NumericType.LONG+" are supported." + + "Field "+fieldName ); + } + } + + /** this adapter is quite weird. ords are per doc index, don't use ords across different docs*/ + static Function sortedNumericAsSortedSetDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new SortedSetDocValues() { + + private int index = Integer.MIN_VALUE; + + @Override + public long nextOrd() { + return index < numerics.count()-1 ? ++index : NO_MORE_ORDS; + } + + @Override + public void setDocument(int docID) { + numerics.setDocument(docID); + index=-1; + } + + @Override + public BytesRef lookupOrd(long ord) { + assert ord>=0 && ord mvFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.MV(mvFunction)); + case Avg: + return new MV.Avg(mvFunction); + default: + return new MV(mvFunction, mode); + } + } + + static Function verbose(PrintStream out, Function mvFunction){ + return (ctx) -> { + final SortedSetDocValues target = mvFunction.apply(ctx); + return new SortedSetDocValues() { + + @Override + public void setDocument(int docID) { + target.setDocument(docID); + out.println("\ndoc# "+docID); + } + + @Override + public long nextOrd() { + return target.nextOrd(); + } + + @Override + public BytesRef lookupOrd(long ord) { + final BytesRef val = target.lookupOrd(ord); + out.println(val.toString()+", "); + return val; + } + + @Override + public long getValueCount() { + return target.getValueCount(); + } + }; + + }; + } + + static GenericTermsCollector createCollectorSV(Function svFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.SV(svFunction)); + case Avg: + return new SV.Avg(svFunction); + default: + return new SV(svFunction, mode); + } + } + + static GenericTermsCollector wrap(final TermsCollector collector) { + return new GenericTermsCollector() { + + + @Override + public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { + return collector.getLeafCollector(context); + } + + @Override + public boolean needsScores() { + return collector.needsScores(); + } + + @Override + public BytesRefHash getCollectedTerms() { + return collector.getCollectorTerms(); + } + + @Override + public float[] getScoresPerTerm() { + throw new UnsupportedOperationException("scores are not available for "+collector); + } + }; + } +} Property changes on: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (working copy) @@ -1,5 +1,14 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Locale; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValuesType; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -21,13 +30,12 @@ import org.apache.lucene.index.LeafReader; import org.apache.lucene.index.MultiDocValues; import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchNoDocsQuery; import org.apache.lucene.search.Query; +import org.apache.lucene.search.join.DocValuesTermsCollector.Function; -import java.io.IOException; -import java.util.Locale; - /** * Utility for query time joining. * @@ -67,28 +75,87 @@ * @throws IOException If I/O related errors occur */ public static Query createJoinQuery(String fromField, - boolean multipleValuesPerDocument, - String toField, - Query fromQuery, - IndexSearcher fromSearcher, - ScoreMode scoreMode) throws IOException { + boolean multipleValuesPerDocument, + String toField, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsWithScoreCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedSetDocValues(fromField); + termsWithScoreCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.binaryDocValues(fromField); + termsWithScoreCollector = GenericTermsCollector.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsWithScoreCollector); + + } + + /** + * Method for query time joining for numeric fields. It supports multi- and single- values longs and ints. + * All considerations from {@link JoinUtil#createJoinQuery(String, boolean, String, Query, IndexSearcher, ScoreMode)} are applicable here too, + * though memory consumption might be higher. + *

+ * + * @param fromField The from field to join from + * @param multipleValuesPerDocument Whether the from field has multiple terms per document + * when true fromField might be {@link DocValuesType#SORTED_NUMERIC}, + * otherwise fromField should be {@link DocValuesType#NUMERIC} + * @param toField The to field to join to, should be {@link IntField} or {@link LongField} + * @param numericType either {@link NumericType#INT} or {@link NumericType#LONG}, it should correspond to fromField and toField types + * @param fromQuery The query to match documents on the from side + * @param fromSearcher The searcher that executed the specified fromQuery + * @param scoreMode Instructs how scores from the fromQuery are mapped to the returned query + * @return a {@link Query} instance that can be used to join documents based on the + * terms in the from and to field + * @throws IOException If I/O related errors occur + */ + + public static Query createJoinQuery(String fromField, + boolean multipleValuesPerDocument, + String toField, NumericType numericType, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedNumericAsSortedSetDocValues(fromField,numericType); + termsCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.numericAsBinaryDocValues(fromField,numericType); + termsCollector = GenericTermsCollector.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsCollector); + + } + + private static Query createJoinQuery(boolean multipleValuesPerDocument, String toField, Query fromQuery, + IndexSearcher fromSearcher, ScoreMode scoreMode, final GenericTermsCollector collector) + throws IOException { + + fromSearcher.search(fromQuery, collector); + switch (scoreMode) { case None: - TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument); - fromSearcher.search(fromQuery, termsCollector); - return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms()); + return new TermsQuery(toField, fromQuery, collector.getCollectedTerms()); case Total: case Max: case Min: case Avg: - TermsWithScoreCollector termsWithScoreCollector = - TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); - fromSearcher.search(fromQuery, termsWithScoreCollector); return new TermsIncludingScoreQuery( toField, multipleValuesPerDocument, - termsWithScoreCollector.getCollectedTerms(), - termsWithScoreCollector.getScoresPerTerm(), + collector.getCollectedTerms(), + collector.getScoresPerTerm(), fromQuery ); default: @@ -96,6 +163,7 @@ } } + /** * Delegates to {@link #createJoinQuery(String, Query, Query, IndexSearcher, ScoreMode, MultiDocValues.OrdinalMap, int, int)}, * but disables the min and max filtering. Index: lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (working copy) @@ -19,11 +19,8 @@ import java.io.IOException; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; @@ -32,19 +29,19 @@ * * @lucene.experimental */ -abstract class TermsCollector extends SimpleCollector { +abstract class TermsCollector extends DocValuesTermsCollector { - final String field; + TermsCollector(Function docValuesCall) { + super(docValuesCall); + } + final BytesRefHash collectorTerms = new BytesRefHash(); - TermsCollector(String field) { - this.field = field; - } - public BytesRefHash getCollectorTerms() { return collectorTerms; } + /** * Chooses the right {@link TermsCollector} implementation. * @@ -52,55 +49,42 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsCollector} instance */ - static TermsCollector create(String field, boolean multipleValuesPerDocument) { - return multipleValuesPerDocument ? new MV(field) : new SV(field); + static TermsCollector create(String field, boolean multipleValuesPerDocument) { + return multipleValuesPerDocument + ? new MV(sortedSetDocValues(field)) + : new SV(binaryDocValues(field)); } - + // impl that works with multiple values per document - static class MV extends TermsCollector { - final BytesRef scratch = new BytesRef(); - private SortedSetDocValues docTermOrds; - - MV(String field) { - super(field); + static class MV extends TermsCollector { + + MV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - docTermOrds.setDocument(doc); long ord; - while ((ord = docTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - final BytesRef term = docTermOrds.lookupOrd(ord); + docValues.setDocument(doc); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + final BytesRef term = docValues.lookupOrd(ord); collectorTerms.add(term); } } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - docTermOrds = DocValues.getSortedSet(context.reader(), field); - } } // impl that works with single value per document - static class SV extends TermsCollector { + static class SV extends TermsCollector { - final BytesRef spare = new BytesRef(); - private BinaryDocValues fromDocTerms; - - SV(String field) { - super(field); + SV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - final BytesRef term = fromDocTerms.get(doc); + final BytesRef term = docValues.get(doc); collectorTerms.add(term); } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } } @Override Index: lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (working copy) @@ -18,6 +18,7 @@ */ import java.io.IOException; +import java.io.PrintStream; import java.util.Locale; import java.util.Set; @@ -37,6 +38,7 @@ import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.NumericUtils; class TermsIncludingScoreQuery extends Query { @@ -268,5 +270,23 @@ } } } - + + void dump(PrintStream out){ + out.println(field+":"); + final BytesRef ref = new BytesRef(); + for (int i = 0; i < terms.size(); i++) { + terms.get(ords[i], ref); + out.print(ref+" "+ref.utf8ToString()+" "); + try { + out.print(Long.toHexString(NumericUtils.prefixCodedToLong(ref))+"L"); + } catch (Exception e) { + try { + out.print(Integer.toHexString(NumericUtils.prefixCodedToInt(ref))+"i"); + } catch (Exception ee) { + } + } + out.println(" score="+scores[ords[i]]); + out.println(""); + } + } } Index: lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (working copy) @@ -1,5 +1,8 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Arrays; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -18,22 +21,16 @@ */ import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.BytesRefHash; -import java.io.IOException; -import java.util.Arrays; +abstract class TermsWithScoreCollector extends DocValuesTermsCollector + implements GenericTermsCollector { -abstract class TermsWithScoreCollector extends SimpleCollector { - private final static int INITIAL_ARRAY_SIZE = 0; - final String field; final BytesRefHash collectedTerms = new BytesRefHash(); final ScoreMode scoreMode; @@ -40,8 +37,8 @@ Scorer scorer; float[] scoreSums = new float[INITIAL_ARRAY_SIZE]; - TermsWithScoreCollector(String field, ScoreMode scoreMode) { - this.field = field; + TermsWithScoreCollector(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall); this.scoreMode = scoreMode; if (scoreMode == ScoreMode.Min) { Arrays.fill(scoreSums, Float.POSITIVE_INFINITY); @@ -50,10 +47,12 @@ } } + @Override public BytesRefHash getCollectedTerms() { return collectedTerms; } - + + @Override public float[] getScoresPerTerm() { return scoreSums; } @@ -70,36 +69,34 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsWithScoreCollector} instance */ - static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { + static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { if (multipleValuesPerDocument) { switch (scoreMode) { case Avg: - return new MV.Avg(field); + return new MV.Avg(sortedSetDocValues(field)); default: - return new MV(field, scoreMode); + return new MV(sortedSetDocValues(field), scoreMode); } } else { switch (scoreMode) { case Avg: - return new SV.Avg(field); + return new SV.Avg(binaryDocValues(field)); default: - return new SV(field, scoreMode); + return new SV(binaryDocValues(field), scoreMode); } } } - + // impl that works with single value per document - static class SV extends TermsWithScoreCollector { + static class SV extends TermsWithScoreCollector { - BinaryDocValues fromDocTerms; - - SV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + SV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -133,26 +130,23 @@ scoreSums[ord] = current; } break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } - static class Avg extends SV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -187,20 +181,18 @@ } // impl that works with multiple values per document - static class MV extends TermsWithScoreCollector { + static class MV extends TermsWithScoreCollector { - SortedSetDocValues fromDocTermOrds; - - MV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + MV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { @@ -225,29 +217,26 @@ case Max: scoreSums[termID] = Math.max(scoreSums[termID], scorer.score()); break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTermOrds = DocValues.getSortedSet(context.reader(), field); - } - static class Avg extends MV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { Index: lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java =================================================================== --- lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (revision 1718426) +++ lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (working copy) @@ -1,31 +1,30 @@ package org.apache.lucene.search.join; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Comparator; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Random; +import java.util.Set; +import java.util.SortedSet; +import java.util.TreeSet; -import com.carrotsearch.randomizedtesting.generators.RandomInts; -import com.carrotsearch.randomizedtesting.generators.RandomPicks; - import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.analysis.MockTokenizer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; import org.apache.lucene.document.NumericDocValuesField; import org.apache.lucene.document.SortedDocValuesField; +import org.apache.lucene.document.SortedNumericDocValuesField; import org.apache.lucene.document.SortedSetDocValuesField; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; @@ -78,19 +77,26 @@ import org.apache.lucene.util.packed.PackedInts; import org.junit.Test; -import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.Comparator; -import java.util.HashMap; -import java.util.HashSet; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.Set; -import java.util.SortedSet; -import java.util.TreeSet; +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +import com.carrotsearch.randomizedtesting.generators.RandomInts; +import com.carrotsearch.randomizedtesting.generators.RandomPicks; + public class TestJoinUtil extends LuceneTestCase { public void testSimple() throws Exception { @@ -850,10 +856,18 @@ } final Query joinQuery; - if (from) { - joinQuery = JoinUtil.createJoinQuery("from", multipleValuesPerDocument, "to", actualQuery, indexSearcher, scoreMode); - } else { - joinQuery = JoinUtil.createJoinQuery("to", multipleValuesPerDocument, "from", actualQuery, indexSearcher, scoreMode); + { + // single val can be handled by multiple-vals + final boolean muliValsQuery = multipleValuesPerDocument || random().nextBoolean(); + final String fromField = from ? "from":"to"; + final String toField = from ? "to":"from"; + + if (random().nextBoolean()) { // numbers + final NumericType numType = random().nextBoolean() ? NumericType.INT: NumericType.LONG ; + joinQuery = JoinUtil.createJoinQuery(fromField+numType, muliValsQuery, toField+numType, numType, actualQuery, indexSearcher, scoreMode); + } else { + joinQuery = JoinUtil.createJoinQuery(fromField, muliValsQuery, toField, actualQuery, indexSearcher, scoreMode); + } } if (VERBOSE) { System.out.println("joinQuery=" + joinQuery); @@ -897,7 +911,6 @@ return; } - assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); if (VERBOSE) { for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { System.out.printf(Locale.ENGLISH, "Expected doc: %d | Actual doc: %d\n", expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -904,6 +917,7 @@ System.out.printf(Locale.ENGLISH, "Expected score: %f | Actual score: %f\n", expectedTopDocs.scoreDocs[i].score, actualTopDocs.scoreDocs[i].score); } } + assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { assertEquals(expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -919,14 +933,15 @@ } Directory dir = newDirectory(); + final Random random = random(); RandomIndexWriter w = new RandomIndexWriter( - random(), + random, dir, - newIndexWriterConfig(new MockAnalyzer(random(), MockTokenizer.KEYWORD, false)) + newIndexWriterConfig(new MockAnalyzer(random, MockTokenizer.KEYWORD, false)) ); IndexIterationContext context = new IndexIterationContext(); - int numRandomValues = nDocs / RandomInts.randomIntBetween(random(), 2, 10); + int numRandomValues = nDocs / RandomInts.randomIntBetween(random, 1, 4); context.randomUniqueValues = new String[numRandomValues]; Set trackSet = new HashSet<>(); context.randomFrom = new boolean[numRandomValues]; @@ -933,32 +948,46 @@ for (int i = 0; i < numRandomValues; i++) { String uniqueRandomValue; do { -// uniqueRandomValue = TestUtil.randomRealisticUnicodeString(random()); - uniqueRandomValue = TestUtil.randomSimpleString(random()); + // the trick is to generate values which will be ordered similarly for string, ints&longs, positive nums makes it easier + final int nextInt = random.nextInt(Integer.MAX_VALUE); + uniqueRandomValue = String.format(Locale.ROOT, "%08x", nextInt); + assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16); } while ("".equals(uniqueRandomValue) || trackSet.contains(uniqueRandomValue)); + // Generate unique values and empty strings aren't allowed. trackSet.add(uniqueRandomValue); - context.randomFrom[i] = random().nextBoolean(); + + context.randomFrom[i] = random.nextBoolean(); context.randomUniqueValues[i] = uniqueRandomValue; + } + List randomUniqueValuesReplica = new ArrayList<>(Arrays.asList(context.randomUniqueValues)); + RandomDoc[] docs = new RandomDoc[nDocs]; for (int i = 0; i < nDocs; i++) { String id = Integer.toString(i); - int randomI = random().nextInt(context.randomUniqueValues.length); + int randomI = random.nextInt(context.randomUniqueValues.length); String value = context.randomUniqueValues[randomI]; Document document = new Document(); - document.add(newTextField(random(), "id", id, Field.Store.YES)); - document.add(newTextField(random(), "value", value, Field.Store.NO)); + document.add(newTextField(random, "id", id, Field.Store.YES)); + document.add(newTextField(random, "value", value, Field.Store.NO)); boolean from = context.randomFrom[randomI]; - int numberOfLinkValues = multipleValuesPerDocument ? 2 + random().nextInt(10) : 1; + int numberOfLinkValues = multipleValuesPerDocument ? Math.min(2 + random.nextInt(10), context.randomUniqueValues.length) : 1; docs[i] = new RandomDoc(id, numberOfLinkValues, value, from); if (globalOrdinalJoin) { document.add(newStringField("type", from ? "from" : "to", Field.Store.NO)); } - for (int j = 0; j < numberOfLinkValues; j++) { - String linkValue = context.randomUniqueValues[random().nextInt(context.randomUniqueValues.length)]; + final List subValues; + { + int start = randomUniqueValuesReplica.size()==numberOfLinkValues? 0 : random.nextInt(randomUniqueValuesReplica.size()-numberOfLinkValues); + subValues = randomUniqueValuesReplica.subList(start, start+numberOfLinkValues); + Collections.shuffle(subValues, random); + } + for (String linkValue : subValues) { + + assert !docs[i].linkValues.contains(linkValue); docs[i].linkValues.add(linkValue); if (from) { if (!context.fromDocuments.containsKey(linkValue)) { @@ -970,15 +999,8 @@ context.fromDocuments.get(linkValue).add(docs[i]); context.randomValueFromDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "from", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("from", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("from", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "from", linkValue, multipleValuesPerDocument, globalOrdinalJoin); + } else { if (!context.toDocuments.containsKey(linkValue)) { context.toDocuments.put(linkValue, new ArrayList<>()); @@ -989,20 +1011,12 @@ context.toDocuments.get(linkValue).add(docs[i]); context.randomValueToDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "to", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("to", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("to", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "to", linkValue, multipleValuesPerDocument, globalOrdinalJoin); } } w.addDocument(document); - if (random().nextInt(10) == 4) { + if (random.nextInt(10) == 4) { w.commit(); } if (VERBOSE) { @@ -1010,7 +1024,7 @@ } } - if (random().nextBoolean()) { + if (random.nextBoolean()) { w.forceMerge(1); } w.close(); @@ -1185,6 +1199,30 @@ return context; } + private void addLinkFields(final Random random, Document document, final String fieldName, String linkValue, + boolean multipleValuesPerDocument, boolean globalOrdinalJoin) { + document.add(newTextField(random, fieldName, linkValue, Field.Store.NO)); + + final int linkInt = Integer.parseUnsignedInt(linkValue,16); + document.add(new IntField(fieldName+NumericType.INT, linkInt, Field.Store.NO)); + + final long linkLong = linkInt<<32 | linkInt; + document.add(new LongField(fieldName+NumericType.LONG, linkLong, Field.Store.NO)); + + if (multipleValuesPerDocument) { + document.add(new SortedSetDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } else { + document.add(new SortedDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new NumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new NumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } + if (globalOrdinalJoin) { + document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); + } + } + private TopDocs createExpectedTopDocs(String queryValue, final boolean from, final ScoreMode scoreMode, ```

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

Nice work Mikhail! I love the lambdas.

Some random comments:

coder() could take the field name so that the IllegalArgumentException can report the field in error
please put spaces after if and around else. This is the dominant style in our codebase, and I prefer it too, FWIW.
createCollectorSV() could have one switch instead of an if and then a switch; no?

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

Thanks, @dsmiley! Do you really think this surgery makes sense? I addressed your points above. The curios fact is that joining by numbers takes more heap than strings. I tried to provide testRandom..() coverage. so far it fails on comparing score values. see recent LUCENE-5868.patch.

LUCENE-5868.patch

```diff Index: lucene/CHANGES.txt =================================================================== --- lucene/CHANGES.txt (revision 1718426) +++ lucene/CHANGES.txt (working copy) @@ -108,6 +108,12 @@ ======================= Lucene 5.5.0 ======================= +New Features + +* LUCENE-5868: JoinUtil.createJoinQuery(..,NumericType,..) query-time join + for LONG and INT fields with NUMERIC and SORTED_NUMERIC doc values. + (Alexey Zelin via Mikhail Khludnev) + API Changes * #7958: Grouping sortWithinGroup variables used to allow null to mean Index: lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (revision 0) +++ lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (working copy) @@ -0,0 +1,136 @@ +package org.apache.lucene.search.join; + +import java.io.IOException; +import java.util.function.LongConsumer; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.SimpleCollector; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; +import org.apache.lucene.util.NumericUtils; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +abstract class DocValuesTermsCollector extends SimpleCollector { + + @FunctionalInterface + static interface Function { + R apply(LeafReader t) throws IOException ; + } + + protected DV docValues; + private final Function docValuesCall; + + public DocValuesTermsCollector(Function docValuesCall) { + this.docValuesCall = docValuesCall; + } + + @Override + protected final void doSetNextReader(LeafReaderContext context) throws IOException { + docValues = docValuesCall.apply(context.reader()); + } + + static Function binaryDocValues(String field) { + return (ctx) -> DocValues.getBinary(ctx, field); + } + static Function sortedSetDocValues(String field) { + return (ctx) -> DocValues.getSortedSet(ctx, field); + } + + static Function numericAsBinaryDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final NumericDocValues numeric = DocValues.getNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new BinaryDocValues() { + @Override + public BytesRef get(int docID) { + final long lVal = numeric.get(docID); + coder.accept(lVal); + return bytes.get(); + } + }; + }; + } + + static LongConsumer coder(BytesRefBuilder bytes, NumericType type, String fieldName){ + switch(type){ + case INT: + return (l) -> NumericUtils.intToPrefixCoded((int)l, 0, bytes); + case LONG: + return (l) -> NumericUtils.longToPrefixCoded(l, 0, bytes); + default: + throw new IllegalArgumentException("Unsupported "+type+ + ". Only "+NumericType.INT+" and "+NumericType.LONG+" are supported." + + "Field "+fieldName ); + } + } + + /** this adapter is quite weird. ords are per doc index, don't use ords across different docs*/ + static Function sortedNumericAsSortedSetDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new SortedSetDocValues() { + + private int index = Integer.MIN_VALUE; + + @Override + public long nextOrd() { + return index < numerics.count()-1 ? ++index : NO_MORE_ORDS; + } + + @Override + public void setDocument(int docID) { + numerics.setDocument(docID); + index=-1; + } + + @Override + public BytesRef lookupOrd(long ord) { + assert ord>=0 && ord mvFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.MV(mvFunction)); + case Avg: + return new MV.Avg(mvFunction); + default: + return new MV(mvFunction, mode); + } + } + + static Function verbose(PrintStream out, Function mvFunction){ + return (ctx) -> { + final SortedSetDocValues target = mvFunction.apply(ctx); + return new SortedSetDocValues() { + + @Override + public void setDocument(int docID) { + target.setDocument(docID); + out.println("\ndoc# "+docID); + } + + @Override + public long nextOrd() { + return target.nextOrd(); + } + + @Override + public BytesRef lookupOrd(long ord) { + final BytesRef val = target.lookupOrd(ord); + out.println(val.toString()+", "); + return val; + } + + @Override + public long getValueCount() { + return target.getValueCount(); + } + }; + + }; + } + + static GenericTermsCollector createCollectorSV(Function svFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.SV(svFunction)); + case Avg: + return new SV.Avg(svFunction); + default: + return new SV(svFunction, mode); + } + } + + static GenericTermsCollector wrap(final TermsCollector collector) { + return new GenericTermsCollector() { + + + @Override + public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { + return collector.getLeafCollector(context); + } + + @Override + public boolean needsScores() { + return collector.needsScores(); + } + + @Override + public BytesRefHash getCollectedTerms() { + return collector.getCollectorTerms(); + } + + @Override + public float[] getScoresPerTerm() { + throw new UnsupportedOperationException("scores are not available for "+collector); + } + }; + } +} Property changes on: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (working copy) @@ -1,5 +1,14 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Locale; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValuesType; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -21,13 +30,12 @@ import org.apache.lucene.index.LeafReader; import org.apache.lucene.index.MultiDocValues; import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchNoDocsQuery; import org.apache.lucene.search.Query; +import org.apache.lucene.search.join.DocValuesTermsCollector.Function; -import java.io.IOException; -import java.util.Locale; - /** * Utility for query time joining. * @@ -67,28 +75,87 @@ * @throws IOException If I/O related errors occur */ public static Query createJoinQuery(String fromField, - boolean multipleValuesPerDocument, - String toField, - Query fromQuery, - IndexSearcher fromSearcher, - ScoreMode scoreMode) throws IOException { + boolean multipleValuesPerDocument, + String toField, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsWithScoreCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedSetDocValues(fromField); + termsWithScoreCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.binaryDocValues(fromField); + termsWithScoreCollector = GenericTermsCollector.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsWithScoreCollector); + + } + + /** + * Method for query time joining for numeric fields. It supports multi- and single- values longs and ints. + * All considerations from {@link JoinUtil#createJoinQuery(String, boolean, String, Query, IndexSearcher, ScoreMode)} are applicable here too, + * though memory consumption might be higher. + *

+ * + * @param fromField The from field to join from + * @param multipleValuesPerDocument Whether the from field has multiple terms per document + * when true fromField might be {@link DocValuesType#SORTED_NUMERIC}, + * otherwise fromField should be {@link DocValuesType#NUMERIC} + * @param toField The to field to join to, should be {@link IntField} or {@link LongField} + * @param numericType either {@link NumericType#INT} or {@link NumericType#LONG}, it should correspond to fromField and toField types + * @param fromQuery The query to match documents on the from side + * @param fromSearcher The searcher that executed the specified fromQuery + * @param scoreMode Instructs how scores from the fromQuery are mapped to the returned query + * @return a {@link Query} instance that can be used to join documents based on the + * terms in the from and to field + * @throws IOException If I/O related errors occur + */ + + public static Query createJoinQuery(String fromField, + boolean multipleValuesPerDocument, + String toField, NumericType numericType, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedNumericAsSortedSetDocValues(fromField,numericType); + termsCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.numericAsBinaryDocValues(fromField,numericType); + termsCollector = GenericTermsCollector.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsCollector); + + } + + private static Query createJoinQuery(boolean multipleValuesPerDocument, String toField, Query fromQuery, + IndexSearcher fromSearcher, ScoreMode scoreMode, final GenericTermsCollector collector) + throws IOException { + + fromSearcher.search(fromQuery, collector); + switch (scoreMode) { case None: - TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument); - fromSearcher.search(fromQuery, termsCollector); - return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms()); + return new TermsQuery(toField, fromQuery, collector.getCollectedTerms()); case Total: case Max: case Min: case Avg: - TermsWithScoreCollector termsWithScoreCollector = - TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); - fromSearcher.search(fromQuery, termsWithScoreCollector); return new TermsIncludingScoreQuery( toField, multipleValuesPerDocument, - termsWithScoreCollector.getCollectedTerms(), - termsWithScoreCollector.getScoresPerTerm(), + collector.getCollectedTerms(), + collector.getScoresPerTerm(), fromQuery ); default: @@ -96,6 +163,7 @@ } } + /** * Delegates to {@link #createJoinQuery(String, Query, Query, IndexSearcher, ScoreMode, MultiDocValues.OrdinalMap, int, int)}, * but disables the min and max filtering. Index: lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (working copy) @@ -19,11 +19,8 @@ import java.io.IOException; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; @@ -32,19 +29,19 @@ * * @lucene.experimental */ -abstract class TermsCollector extends SimpleCollector { +abstract class TermsCollector extends DocValuesTermsCollector { - final String field; + TermsCollector(Function docValuesCall) { + super(docValuesCall); + } + final BytesRefHash collectorTerms = new BytesRefHash(); - TermsCollector(String field) { - this.field = field; - } - public BytesRefHash getCollectorTerms() { return collectorTerms; } + /** * Chooses the right {@link TermsCollector} implementation. * @@ -52,55 +49,42 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsCollector} instance */ - static TermsCollector create(String field, boolean multipleValuesPerDocument) { - return multipleValuesPerDocument ? new MV(field) : new SV(field); + static TermsCollector create(String field, boolean multipleValuesPerDocument) { + return multipleValuesPerDocument + ? new MV(sortedSetDocValues(field)) + : new SV(binaryDocValues(field)); } - + // impl that works with multiple values per document - static class MV extends TermsCollector { - final BytesRef scratch = new BytesRef(); - private SortedSetDocValues docTermOrds; - - MV(String field) { - super(field); + static class MV extends TermsCollector { + + MV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - docTermOrds.setDocument(doc); long ord; - while ((ord = docTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - final BytesRef term = docTermOrds.lookupOrd(ord); + docValues.setDocument(doc); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + final BytesRef term = docValues.lookupOrd(ord); collectorTerms.add(term); } } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - docTermOrds = DocValues.getSortedSet(context.reader(), field); - } } // impl that works with single value per document - static class SV extends TermsCollector { + static class SV extends TermsCollector { - final BytesRef spare = new BytesRef(); - private BinaryDocValues fromDocTerms; - - SV(String field) { - super(field); + SV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - final BytesRef term = fromDocTerms.get(doc); + final BytesRef term = docValues.get(doc); collectorTerms.add(term); } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } } @Override Index: lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (working copy) @@ -18,6 +18,7 @@ */ import java.io.IOException; +import java.io.PrintStream; import java.util.Locale; import java.util.Set; @@ -37,6 +38,7 @@ import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.NumericUtils; class TermsIncludingScoreQuery extends Query { @@ -268,5 +270,23 @@ } } } - + + void dump(PrintStream out){ + out.println(field+":"); + final BytesRef ref = new BytesRef(); + for (int i = 0; i < terms.size(); i++) { + terms.get(ords[i], ref); + out.print(ref+" "+ref.utf8ToString()+" "); + try { + out.print(Long.toHexString(NumericUtils.prefixCodedToLong(ref))+"L"); + } catch (Exception e) { + try { + out.print(Integer.toHexString(NumericUtils.prefixCodedToInt(ref))+"i"); + } catch (Exception ee) { + } + } + out.println(" score="+scores[ords[i]]); + out.println(""); + } + } } Index: lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (working copy) @@ -1,5 +1,8 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Arrays; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -18,22 +21,16 @@ */ import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.BytesRefHash; -import java.io.IOException; -import java.util.Arrays; +abstract class TermsWithScoreCollector extends DocValuesTermsCollector + implements GenericTermsCollector { -abstract class TermsWithScoreCollector extends SimpleCollector { - private final static int INITIAL_ARRAY_SIZE = 0; - final String field; final BytesRefHash collectedTerms = new BytesRefHash(); final ScoreMode scoreMode; @@ -40,8 +37,8 @@ Scorer scorer; float[] scoreSums = new float[INITIAL_ARRAY_SIZE]; - TermsWithScoreCollector(String field, ScoreMode scoreMode) { - this.field = field; + TermsWithScoreCollector(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall); this.scoreMode = scoreMode; if (scoreMode == ScoreMode.Min) { Arrays.fill(scoreSums, Float.POSITIVE_INFINITY); @@ -50,10 +47,12 @@ } } + @Override public BytesRefHash getCollectedTerms() { return collectedTerms; } - + + @Override public float[] getScoresPerTerm() { return scoreSums; } @@ -70,36 +69,34 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsWithScoreCollector} instance */ - static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { + static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { if (multipleValuesPerDocument) { switch (scoreMode) { case Avg: - return new MV.Avg(field); + return new MV.Avg(sortedSetDocValues(field)); default: - return new MV(field, scoreMode); + return new MV(sortedSetDocValues(field), scoreMode); } } else { switch (scoreMode) { case Avg: - return new SV.Avg(field); + return new SV.Avg(binaryDocValues(field)); default: - return new SV(field, scoreMode); + return new SV(binaryDocValues(field), scoreMode); } } } - + // impl that works with single value per document - static class SV extends TermsWithScoreCollector { + static class SV extends TermsWithScoreCollector { - BinaryDocValues fromDocTerms; - - SV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + SV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -133,26 +130,23 @@ scoreSums[ord] = current; } break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } - static class Avg extends SV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -187,20 +181,18 @@ } // impl that works with multiple values per document - static class MV extends TermsWithScoreCollector { + static class MV extends TermsWithScoreCollector { - SortedSetDocValues fromDocTermOrds; - - MV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + MV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { @@ -225,29 +217,26 @@ case Max: scoreSums[termID] = Math.max(scoreSums[termID], scorer.score()); break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTermOrds = DocValues.getSortedSet(context.reader(), field); - } - static class Avg extends MV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { Index: lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java =================================================================== --- lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (revision 1718426) +++ lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (working copy) @@ -1,31 +1,30 @@ package org.apache.lucene.search.join; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Comparator; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Random; +import java.util.Set; +import java.util.SortedSet; +import java.util.TreeSet; -import com.carrotsearch.randomizedtesting.generators.RandomInts; -import com.carrotsearch.randomizedtesting.generators.RandomPicks; - import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.analysis.MockTokenizer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; import org.apache.lucene.document.NumericDocValuesField; import org.apache.lucene.document.SortedDocValuesField; +import org.apache.lucene.document.SortedNumericDocValuesField; import org.apache.lucene.document.SortedSetDocValuesField; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; @@ -78,19 +77,26 @@ import org.apache.lucene.util.packed.PackedInts; import org.junit.Test; -import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.Comparator; -import java.util.HashMap; -import java.util.HashSet; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.Set; -import java.util.SortedSet; -import java.util.TreeSet; +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +import com.carrotsearch.randomizedtesting.generators.RandomInts; +import com.carrotsearch.randomizedtesting.generators.RandomPicks; + public class TestJoinUtil extends LuceneTestCase { public void testSimple() throws Exception { @@ -850,10 +856,18 @@ } final Query joinQuery; - if (from) { - joinQuery = JoinUtil.createJoinQuery("from", multipleValuesPerDocument, "to", actualQuery, indexSearcher, scoreMode); - } else { - joinQuery = JoinUtil.createJoinQuery("to", multipleValuesPerDocument, "from", actualQuery, indexSearcher, scoreMode); + { + // single val can be handled by multiple-vals + final boolean muliValsQuery = multipleValuesPerDocument || random().nextBoolean(); + final String fromField = from ? "from":"to"; + final String toField = from ? "to":"from"; + + if (random().nextBoolean()) { // numbers + final NumericType numType = random().nextBoolean() ? NumericType.INT: NumericType.LONG ; + joinQuery = JoinUtil.createJoinQuery(fromField+numType, muliValsQuery, toField+numType, numType, actualQuery, indexSearcher, scoreMode); + } else { + joinQuery = JoinUtil.createJoinQuery(fromField, muliValsQuery, toField, actualQuery, indexSearcher, scoreMode); + } } if (VERBOSE) { System.out.println("joinQuery=" + joinQuery); @@ -897,7 +911,6 @@ return; } - assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); if (VERBOSE) { for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { System.out.printf(Locale.ENGLISH, "Expected doc: %d | Actual doc: %d\n", expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -904,6 +917,7 @@ System.out.printf(Locale.ENGLISH, "Expected score: %f | Actual score: %f\n", expectedTopDocs.scoreDocs[i].score, actualTopDocs.scoreDocs[i].score); } } + assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { assertEquals(expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -919,14 +933,15 @@ } Directory dir = newDirectory(); + final Random random = random(); RandomIndexWriter w = new RandomIndexWriter( - random(), + random, dir, - newIndexWriterConfig(new MockAnalyzer(random(), MockTokenizer.KEYWORD, false)) + newIndexWriterConfig(new MockAnalyzer(random, MockTokenizer.KEYWORD, false)) ); IndexIterationContext context = new IndexIterationContext(); - int numRandomValues = nDocs / RandomInts.randomIntBetween(random(), 2, 10); + int numRandomValues = nDocs / RandomInts.randomIntBetween(random, 1, 4); context.randomUniqueValues = new String[numRandomValues]; Set trackSet = new HashSet<>(); context.randomFrom = new boolean[numRandomValues]; @@ -933,32 +948,46 @@ for (int i = 0; i < numRandomValues; i++) { String uniqueRandomValue; do { -// uniqueRandomValue = TestUtil.randomRealisticUnicodeString(random()); - uniqueRandomValue = TestUtil.randomSimpleString(random()); + // the trick is to generate values which will be ordered similarly for string, ints&longs, positive nums makes it easier + final int nextInt = random.nextInt(Integer.MAX_VALUE); + uniqueRandomValue = String.format(Locale.ROOT, "%08x", nextInt); + assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16); } while ("".equals(uniqueRandomValue) || trackSet.contains(uniqueRandomValue)); + // Generate unique values and empty strings aren't allowed. trackSet.add(uniqueRandomValue); - context.randomFrom[i] = random().nextBoolean(); + + context.randomFrom[i] = random.nextBoolean(); context.randomUniqueValues[i] = uniqueRandomValue; + } + List randomUniqueValuesReplica = new ArrayList<>(Arrays.asList(context.randomUniqueValues)); + RandomDoc[] docs = new RandomDoc[nDocs]; for (int i = 0; i < nDocs; i++) { String id = Integer.toString(i); - int randomI = random().nextInt(context.randomUniqueValues.length); + int randomI = random.nextInt(context.randomUniqueValues.length); String value = context.randomUniqueValues[randomI]; Document document = new Document(); - document.add(newTextField(random(), "id", id, Field.Store.YES)); - document.add(newTextField(random(), "value", value, Field.Store.NO)); + document.add(newTextField(random, "id", id, Field.Store.YES)); + document.add(newTextField(random, "value", value, Field.Store.NO)); boolean from = context.randomFrom[randomI]; - int numberOfLinkValues = multipleValuesPerDocument ? 2 + random().nextInt(10) : 1; + int numberOfLinkValues = multipleValuesPerDocument ? Math.min(2 + random.nextInt(10), context.randomUniqueValues.length) : 1; docs[i] = new RandomDoc(id, numberOfLinkValues, value, from); if (globalOrdinalJoin) { document.add(newStringField("type", from ? "from" : "to", Field.Store.NO)); } - for (int j = 0; j < numberOfLinkValues; j++) { - String linkValue = context.randomUniqueValues[random().nextInt(context.randomUniqueValues.length)]; + final List subValues; + { + int start = randomUniqueValuesReplica.size()==numberOfLinkValues? 0 : random.nextInt(randomUniqueValuesReplica.size()-numberOfLinkValues); + subValues = randomUniqueValuesReplica.subList(start, start+numberOfLinkValues); + Collections.shuffle(subValues, random); + } + for (String linkValue : subValues) { + + assert !docs[i].linkValues.contains(linkValue); docs[i].linkValues.add(linkValue); if (from) { if (!context.fromDocuments.containsKey(linkValue)) { @@ -970,15 +999,8 @@ context.fromDocuments.get(linkValue).add(docs[i]); context.randomValueFromDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "from", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("from", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("from", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "from", linkValue, multipleValuesPerDocument, globalOrdinalJoin); + } else { if (!context.toDocuments.containsKey(linkValue)) { context.toDocuments.put(linkValue, new ArrayList<>()); @@ -989,20 +1011,12 @@ context.toDocuments.get(linkValue).add(docs[i]); context.randomValueToDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "to", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("to", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("to", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "to", linkValue, multipleValuesPerDocument, globalOrdinalJoin); } } w.addDocument(document); - if (random().nextInt(10) == 4) { + if (random.nextInt(10) == 4) { w.commit(); } if (VERBOSE) { @@ -1010,7 +1024,7 @@ } } - if (random().nextBoolean()) { + if (random.nextBoolean()) { w.forceMerge(1); } w.close(); @@ -1185,6 +1199,30 @@ return context; } + private void addLinkFields(final Random random, Document document, final String fieldName, String linkValue, + boolean multipleValuesPerDocument, boolean globalOrdinalJoin) { + document.add(newTextField(random, fieldName, linkValue, Field.Store.NO)); + + final int linkInt = Integer.parseUnsignedInt(linkValue,16); + document.add(new IntField(fieldName+NumericType.INT, linkInt, Field.Store.NO)); + + final long linkLong = linkInt<<32 | linkInt; + document.add(new LongField(fieldName+NumericType.LONG, linkLong, Field.Store.NO)); + + if (multipleValuesPerDocument) { + document.add(new SortedSetDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } else { + document.add(new SortedDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new NumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new NumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } + if (globalOrdinalJoin) { + document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); + } + } + private TopDocs createExpectedTopDocs(String queryValue, final boolean from, final ScoreMode scoreMode, ```

asfimport commented 8 years ago

David Smiley (@dsmiley) (migrated from JIRA)

(I edited your comment to correct your reference to me, not a similarly named person)

I think the surgery makes sense. It adds a useful feature. The approach leverages the existing underlying BytesRefHash which may not be as optimal as some sort of LongHash or similar but whatever – progress not perfection. Should someone care, such improvements could be made later.

I admit I didn't look at the details of your tests; that would take much more time. I was mostly curious about the implementation side and of the lambdas you made reference to.

asfimport commented 8 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

+1 this looks good. One small thing, maybe rename the parameter name in the protected createJoinQuery(...) method from termsWithScoreCollector to just collector?

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

@martijnvg, I renamed parameter, yet locally.

It turns out that random test fails because SortedSetDV omits duplicate values, but SortedNumberDV doesn't that leads to discrepancy in the scores. I changed TestJoinUtil.createContext(int, boolean, boolean) to deduplicate link values. Score asserts still fail.

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

LUCENE-5868.patch It's done, after all. I had to tweak TestJoinUtil for random tests. Now it generates values ordered similarly for strings, and numbers. It also dedupes values for from, to fields, because numeric docvalues store duplicates and it impact scoring in tests. Now, there is no "simple" test coverage for numeric join. I don't think it's necessary, perhaps I'll cover it in simple Solr case. I want to commit it next week into trunk and 5.x to let it out in 5.5. Please let me know if you wish to veto it. Reviews and ideas are welcome as usual!!

LUCENE-5868.patch

```diff Index: lucene/CHANGES.txt =================================================================== --- lucene/CHANGES.txt (revision 1718426) +++ lucene/CHANGES.txt (working copy) @@ -108,6 +108,12 @@ ======================= Lucene 5.5.0 ======================= +New Features + +* LUCENE-5868: JoinUtil.createJoinQuery(..,NumericType,..) query-time join + for LONG and INT fields with NUMERIC and SORTED_NUMERIC doc values. + (Alexey Zelin via Mikhail Khludnev) + API Changes * #7958: Grouping sortWithinGroup variables used to allow null to mean Index: lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (revision 0) +++ lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (working copy) @@ -0,0 +1,136 @@ +package org.apache.lucene.search.join; + +import java.io.IOException; +import java.util.function.LongConsumer; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.SimpleCollector; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; +import org.apache.lucene.util.NumericUtils; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +abstract class DocValuesTermsCollector extends SimpleCollector { + + @FunctionalInterface + static interface Function { + R apply(LeafReader t) throws IOException ; + } + + protected DV docValues; + private final Function docValuesCall; + + public DocValuesTermsCollector(Function docValuesCall) { + this.docValuesCall = docValuesCall; + } + + @Override + protected final void doSetNextReader(LeafReaderContext context) throws IOException { + docValues = docValuesCall.apply(context.reader()); + } + + static Function binaryDocValues(String field) { + return (ctx) -> DocValues.getBinary(ctx, field); + } + static Function sortedSetDocValues(String field) { + return (ctx) -> DocValues.getSortedSet(ctx, field); + } + + static Function numericAsBinaryDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final NumericDocValues numeric = DocValues.getNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new BinaryDocValues() { + @Override + public BytesRef get(int docID) { + final long lVal = numeric.get(docID); + coder.accept(lVal); + return bytes.get(); + } + }; + }; + } + + static LongConsumer coder(BytesRefBuilder bytes, NumericType type, String fieldName){ + switch(type){ + case INT: + return (l) -> NumericUtils.intToPrefixCoded((int)l, 0, bytes); + case LONG: + return (l) -> NumericUtils.longToPrefixCoded(l, 0, bytes); + default: + throw new IllegalArgumentException("Unsupported "+type+ + ". Only "+NumericType.INT+" and "+NumericType.LONG+" are supported." + + "Field "+fieldName ); + } + } + + /** this adapter is quite weird. ords are per doc index, don't use ords across different docs*/ + static Function sortedNumericAsSortedSetDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new SortedSetDocValues() { + + private int index = Integer.MIN_VALUE; + + @Override + public long nextOrd() { + return index < numerics.count()-1 ? ++index : NO_MORE_ORDS; + } + + @Override + public void setDocument(int docID) { + numerics.setDocument(docID); + index=-1; + } + + @Override + public BytesRef lookupOrd(long ord) { + assert ord>=0 && ord mvFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.MV(mvFunction)); + case Avg: + return new MV.Avg(mvFunction); + default: + return new MV(mvFunction, mode); + } + } + + static Function verbose(PrintStream out, Function mvFunction){ + return (ctx) -> { + final SortedSetDocValues target = mvFunction.apply(ctx); + return new SortedSetDocValues() { + + @Override + public void setDocument(int docID) { + target.setDocument(docID); + out.println("\ndoc# "+docID); + } + + @Override + public long nextOrd() { + return target.nextOrd(); + } + + @Override + public BytesRef lookupOrd(long ord) { + final BytesRef val = target.lookupOrd(ord); + out.println(val.toString()+", "); + return val; + } + + @Override + public long getValueCount() { + return target.getValueCount(); + } + }; + + }; + } + + static GenericTermsCollector createCollectorSV(Function svFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.SV(svFunction)); + case Avg: + return new SV.Avg(svFunction); + default: + return new SV(svFunction, mode); + } + } + + static GenericTermsCollector wrap(final TermsCollector collector) { + return new GenericTermsCollector() { + + + @Override + public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { + return collector.getLeafCollector(context); + } + + @Override + public boolean needsScores() { + return collector.needsScores(); + } + + @Override + public BytesRefHash getCollectedTerms() { + return collector.getCollectorTerms(); + } + + @Override + public float[] getScoresPerTerm() { + throw new UnsupportedOperationException("scores are not available for "+collector); + } + }; + } +} Property changes on: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (working copy) @@ -1,5 +1,14 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Locale; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValuesType; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -21,13 +30,12 @@ import org.apache.lucene.index.LeafReader; import org.apache.lucene.index.MultiDocValues; import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchNoDocsQuery; import org.apache.lucene.search.Query; +import org.apache.lucene.search.join.DocValuesTermsCollector.Function; -import java.io.IOException; -import java.util.Locale; - /** * Utility for query time joining. * @@ -67,28 +75,87 @@ * @throws IOException If I/O related errors occur */ public static Query createJoinQuery(String fromField, - boolean multipleValuesPerDocument, - String toField, - Query fromQuery, - IndexSearcher fromSearcher, - ScoreMode scoreMode) throws IOException { + boolean multipleValuesPerDocument, + String toField, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsWithScoreCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedSetDocValues(fromField); + termsWithScoreCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.binaryDocValues(fromField); + termsWithScoreCollector = GenericTermsCollector.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsWithScoreCollector); + + } + + /** + * Method for query time joining for numeric fields. It supports multi- and single- values longs and ints. + * All considerations from {@link JoinUtil#createJoinQuery(String, boolean, String, Query, IndexSearcher, ScoreMode)} are applicable here too, + * though memory consumption might be higher. + *

+ * + * @param fromField The from field to join from + * @param multipleValuesPerDocument Whether the from field has multiple terms per document + * when true fromField might be {@link DocValuesType#SORTED_NUMERIC}, + * otherwise fromField should be {@link DocValuesType#NUMERIC} + * @param toField The to field to join to, should be {@link IntField} or {@link LongField} + * @param numericType either {@link NumericType#INT} or {@link NumericType#LONG}, it should correspond to fromField and toField types + * @param fromQuery The query to match documents on the from side + * @param fromSearcher The searcher that executed the specified fromQuery + * @param scoreMode Instructs how scores from the fromQuery are mapped to the returned query + * @return a {@link Query} instance that can be used to join documents based on the + * terms in the from and to field + * @throws IOException If I/O related errors occur + */ + + public static Query createJoinQuery(String fromField, + boolean multipleValuesPerDocument, + String toField, NumericType numericType, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedNumericAsSortedSetDocValues(fromField,numericType); + termsCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.numericAsBinaryDocValues(fromField,numericType); + termsCollector = GenericTermsCollector.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsCollector); + + } + + private static Query createJoinQuery(boolean multipleValuesPerDocument, String toField, Query fromQuery, + IndexSearcher fromSearcher, ScoreMode scoreMode, final GenericTermsCollector collector) + throws IOException { + + fromSearcher.search(fromQuery, collector); + switch (scoreMode) { case None: - TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument); - fromSearcher.search(fromQuery, termsCollector); - return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms()); + return new TermsQuery(toField, fromQuery, collector.getCollectedTerms()); case Total: case Max: case Min: case Avg: - TermsWithScoreCollector termsWithScoreCollector = - TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); - fromSearcher.search(fromQuery, termsWithScoreCollector); return new TermsIncludingScoreQuery( toField, multipleValuesPerDocument, - termsWithScoreCollector.getCollectedTerms(), - termsWithScoreCollector.getScoresPerTerm(), + collector.getCollectedTerms(), + collector.getScoresPerTerm(), fromQuery ); default: @@ -96,6 +163,7 @@ } } + /** * Delegates to {@link #createJoinQuery(String, Query, Query, IndexSearcher, ScoreMode, MultiDocValues.OrdinalMap, int, int)}, * but disables the min and max filtering. Index: lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (working copy) @@ -19,11 +19,8 @@ import java.io.IOException; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; @@ -32,19 +29,19 @@ * * @lucene.experimental */ -abstract class TermsCollector extends SimpleCollector { +abstract class TermsCollector extends DocValuesTermsCollector { - final String field; + TermsCollector(Function docValuesCall) { + super(docValuesCall); + } + final BytesRefHash collectorTerms = new BytesRefHash(); - TermsCollector(String field) { - this.field = field; - } - public BytesRefHash getCollectorTerms() { return collectorTerms; } + /** * Chooses the right {@link TermsCollector} implementation. * @@ -52,55 +49,42 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsCollector} instance */ - static TermsCollector create(String field, boolean multipleValuesPerDocument) { - return multipleValuesPerDocument ? new MV(field) : new SV(field); + static TermsCollector create(String field, boolean multipleValuesPerDocument) { + return multipleValuesPerDocument + ? new MV(sortedSetDocValues(field)) + : new SV(binaryDocValues(field)); } - + // impl that works with multiple values per document - static class MV extends TermsCollector { - final BytesRef scratch = new BytesRef(); - private SortedSetDocValues docTermOrds; - - MV(String field) { - super(field); + static class MV extends TermsCollector { + + MV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - docTermOrds.setDocument(doc); long ord; - while ((ord = docTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - final BytesRef term = docTermOrds.lookupOrd(ord); + docValues.setDocument(doc); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + final BytesRef term = docValues.lookupOrd(ord); collectorTerms.add(term); } } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - docTermOrds = DocValues.getSortedSet(context.reader(), field); - } } // impl that works with single value per document - static class SV extends TermsCollector { + static class SV extends TermsCollector { - final BytesRef spare = new BytesRef(); - private BinaryDocValues fromDocTerms; - - SV(String field) { - super(field); + SV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - final BytesRef term = fromDocTerms.get(doc); + final BytesRef term = docValues.get(doc); collectorTerms.add(term); } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } } @Override Index: lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (working copy) @@ -18,6 +18,7 @@ */ import java.io.IOException; +import java.io.PrintStream; import java.util.Locale; import java.util.Set; @@ -37,6 +38,7 @@ import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.NumericUtils; class TermsIncludingScoreQuery extends Query { @@ -268,5 +270,23 @@ } } } - + + void dump(PrintStream out){ + out.println(field+":"); + final BytesRef ref = new BytesRef(); + for (int i = 0; i < terms.size(); i++) { + terms.get(ords[i], ref); + out.print(ref+" "+ref.utf8ToString()+" "); + try { + out.print(Long.toHexString(NumericUtils.prefixCodedToLong(ref))+"L"); + } catch (Exception e) { + try { + out.print(Integer.toHexString(NumericUtils.prefixCodedToInt(ref))+"i"); + } catch (Exception ee) { + } + } + out.println(" score="+scores[ords[i]]); + out.println(""); + } + } } Index: lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (working copy) @@ -1,5 +1,8 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Arrays; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -18,22 +21,16 @@ */ import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.BytesRefHash; -import java.io.IOException; -import java.util.Arrays; +abstract class TermsWithScoreCollector extends DocValuesTermsCollector + implements GenericTermsCollector { -abstract class TermsWithScoreCollector extends SimpleCollector { - private final static int INITIAL_ARRAY_SIZE = 0; - final String field; final BytesRefHash collectedTerms = new BytesRefHash(); final ScoreMode scoreMode; @@ -40,8 +37,8 @@ Scorer scorer; float[] scoreSums = new float[INITIAL_ARRAY_SIZE]; - TermsWithScoreCollector(String field, ScoreMode scoreMode) { - this.field = field; + TermsWithScoreCollector(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall); this.scoreMode = scoreMode; if (scoreMode == ScoreMode.Min) { Arrays.fill(scoreSums, Float.POSITIVE_INFINITY); @@ -50,10 +47,12 @@ } } + @Override public BytesRefHash getCollectedTerms() { return collectedTerms; } - + + @Override public float[] getScoresPerTerm() { return scoreSums; } @@ -70,36 +69,34 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsWithScoreCollector} instance */ - static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { + static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { if (multipleValuesPerDocument) { switch (scoreMode) { case Avg: - return new MV.Avg(field); + return new MV.Avg(sortedSetDocValues(field)); default: - return new MV(field, scoreMode); + return new MV(sortedSetDocValues(field), scoreMode); } } else { switch (scoreMode) { case Avg: - return new SV.Avg(field); + return new SV.Avg(binaryDocValues(field)); default: - return new SV(field, scoreMode); + return new SV(binaryDocValues(field), scoreMode); } } } - + // impl that works with single value per document - static class SV extends TermsWithScoreCollector { + static class SV extends TermsWithScoreCollector { - BinaryDocValues fromDocTerms; - - SV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + SV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -133,26 +130,23 @@ scoreSums[ord] = current; } break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } - static class Avg extends SV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -187,20 +181,18 @@ } // impl that works with multiple values per document - static class MV extends TermsWithScoreCollector { + static class MV extends TermsWithScoreCollector { - SortedSetDocValues fromDocTermOrds; - - MV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + MV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { @@ -225,29 +217,26 @@ case Max: scoreSums[termID] = Math.max(scoreSums[termID], scorer.score()); break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTermOrds = DocValues.getSortedSet(context.reader(), field); - } - static class Avg extends MV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { Index: lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java =================================================================== --- lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (revision 1718426) +++ lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (working copy) @@ -1,31 +1,30 @@ package org.apache.lucene.search.join; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Comparator; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Random; +import java.util.Set; +import java.util.SortedSet; +import java.util.TreeSet; -import com.carrotsearch.randomizedtesting.generators.RandomInts; -import com.carrotsearch.randomizedtesting.generators.RandomPicks; - import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.analysis.MockTokenizer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; import org.apache.lucene.document.NumericDocValuesField; import org.apache.lucene.document.SortedDocValuesField; +import org.apache.lucene.document.SortedNumericDocValuesField; import org.apache.lucene.document.SortedSetDocValuesField; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; @@ -78,19 +77,26 @@ import org.apache.lucene.util.packed.PackedInts; import org.junit.Test; -import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.Comparator; -import java.util.HashMap; -import java.util.HashSet; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.Set; -import java.util.SortedSet; -import java.util.TreeSet; +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +import com.carrotsearch.randomizedtesting.generators.RandomInts; +import com.carrotsearch.randomizedtesting.generators.RandomPicks; + public class TestJoinUtil extends LuceneTestCase { public void testSimple() throws Exception { @@ -850,10 +856,18 @@ } final Query joinQuery; - if (from) { - joinQuery = JoinUtil.createJoinQuery("from", multipleValuesPerDocument, "to", actualQuery, indexSearcher, scoreMode); - } else { - joinQuery = JoinUtil.createJoinQuery("to", multipleValuesPerDocument, "from", actualQuery, indexSearcher, scoreMode); + { + // single val can be handled by multiple-vals + final boolean muliValsQuery = multipleValuesPerDocument || random().nextBoolean(); + final String fromField = from ? "from":"to"; + final String toField = from ? "to":"from"; + + if (random().nextBoolean()) { // numbers + final NumericType numType = random().nextBoolean() ? NumericType.INT: NumericType.LONG ; + joinQuery = JoinUtil.createJoinQuery(fromField+numType, muliValsQuery, toField+numType, numType, actualQuery, indexSearcher, scoreMode); + } else { + joinQuery = JoinUtil.createJoinQuery(fromField, muliValsQuery, toField, actualQuery, indexSearcher, scoreMode); + } } if (VERBOSE) { System.out.println("joinQuery=" + joinQuery); @@ -897,7 +911,6 @@ return; } - assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); if (VERBOSE) { for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { System.out.printf(Locale.ENGLISH, "Expected doc: %d | Actual doc: %d\n", expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -904,6 +917,7 @@ System.out.printf(Locale.ENGLISH, "Expected score: %f | Actual score: %f\n", expectedTopDocs.scoreDocs[i].score, actualTopDocs.scoreDocs[i].score); } } + assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { assertEquals(expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -919,14 +933,15 @@ } Directory dir = newDirectory(); + final Random random = random(); RandomIndexWriter w = new RandomIndexWriter( - random(), + random, dir, - newIndexWriterConfig(new MockAnalyzer(random(), MockTokenizer.KEYWORD, false)) + newIndexWriterConfig(new MockAnalyzer(random, MockTokenizer.KEYWORD, false)) ); IndexIterationContext context = new IndexIterationContext(); - int numRandomValues = nDocs / RandomInts.randomIntBetween(random(), 2, 10); + int numRandomValues = nDocs / RandomInts.randomIntBetween(random, 1, 4); context.randomUniqueValues = new String[numRandomValues]; Set trackSet = new HashSet<>(); context.randomFrom = new boolean[numRandomValues]; @@ -933,32 +948,46 @@ for (int i = 0; i < numRandomValues; i++) { String uniqueRandomValue; do { -// uniqueRandomValue = TestUtil.randomRealisticUnicodeString(random()); - uniqueRandomValue = TestUtil.randomSimpleString(random()); + // the trick is to generate values which will be ordered similarly for string, ints&longs, positive nums makes it easier + final int nextInt = random.nextInt(Integer.MAX_VALUE); + uniqueRandomValue = String.format(Locale.ROOT, "%08x", nextInt); + assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16); } while ("".equals(uniqueRandomValue) || trackSet.contains(uniqueRandomValue)); + // Generate unique values and empty strings aren't allowed. trackSet.add(uniqueRandomValue); - context.randomFrom[i] = random().nextBoolean(); + + context.randomFrom[i] = random.nextBoolean(); context.randomUniqueValues[i] = uniqueRandomValue; + } + List randomUniqueValuesReplica = new ArrayList<>(Arrays.asList(context.randomUniqueValues)); + RandomDoc[] docs = new RandomDoc[nDocs]; for (int i = 0; i < nDocs; i++) { String id = Integer.toString(i); - int randomI = random().nextInt(context.randomUniqueValues.length); + int randomI = random.nextInt(context.randomUniqueValues.length); String value = context.randomUniqueValues[randomI]; Document document = new Document(); - document.add(newTextField(random(), "id", id, Field.Store.YES)); - document.add(newTextField(random(), "value", value, Field.Store.NO)); + document.add(newTextField(random, "id", id, Field.Store.YES)); + document.add(newTextField(random, "value", value, Field.Store.NO)); boolean from = context.randomFrom[randomI]; - int numberOfLinkValues = multipleValuesPerDocument ? 2 + random().nextInt(10) : 1; + int numberOfLinkValues = multipleValuesPerDocument ? Math.min(2 + random.nextInt(10), context.randomUniqueValues.length) : 1; docs[i] = new RandomDoc(id, numberOfLinkValues, value, from); if (globalOrdinalJoin) { document.add(newStringField("type", from ? "from" : "to", Field.Store.NO)); } - for (int j = 0; j < numberOfLinkValues; j++) { - String linkValue = context.randomUniqueValues[random().nextInt(context.randomUniqueValues.length)]; + final List subValues; + { + int start = randomUniqueValuesReplica.size()==numberOfLinkValues? 0 : random.nextInt(randomUniqueValuesReplica.size()-numberOfLinkValues); + subValues = randomUniqueValuesReplica.subList(start, start+numberOfLinkValues); + Collections.shuffle(subValues, random); + } + for (String linkValue : subValues) { + + assert !docs[i].linkValues.contains(linkValue); docs[i].linkValues.add(linkValue); if (from) { if (!context.fromDocuments.containsKey(linkValue)) { @@ -970,15 +999,8 @@ context.fromDocuments.get(linkValue).add(docs[i]); context.randomValueFromDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "from", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("from", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("from", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "from", linkValue, multipleValuesPerDocument, globalOrdinalJoin); + } else { if (!context.toDocuments.containsKey(linkValue)) { context.toDocuments.put(linkValue, new ArrayList<>()); @@ -989,20 +1011,12 @@ context.toDocuments.get(linkValue).add(docs[i]); context.randomValueToDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "to", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("to", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("to", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "to", linkValue, multipleValuesPerDocument, globalOrdinalJoin); } } w.addDocument(document); - if (random().nextInt(10) == 4) { + if (random.nextInt(10) == 4) { w.commit(); } if (VERBOSE) { @@ -1010,7 +1024,7 @@ } } - if (random().nextBoolean()) { + if (random.nextBoolean()) { w.forceMerge(1); } w.close(); @@ -1185,6 +1199,30 @@ return context; } + private void addLinkFields(final Random random, Document document, final String fieldName, String linkValue, + boolean multipleValuesPerDocument, boolean globalOrdinalJoin) { + document.add(newTextField(random, fieldName, linkValue, Field.Store.NO)); + + final int linkInt = Integer.parseUnsignedInt(linkValue,16); + document.add(new IntField(fieldName+NumericType.INT, linkInt, Field.Store.NO)); + + final long linkLong = linkInt<<32 | linkInt; + document.add(new LongField(fieldName+NumericType.LONG, linkLong, Field.Store.NO)); + + if (multipleValuesPerDocument) { + document.add(new SortedSetDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } else { + document.add(new SortedDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new NumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new NumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } + if (globalOrdinalJoin) { + document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); + } + } + private TopDocs createExpectedTopDocs(String queryValue, final boolean from, final ScoreMode scoreMode, ```

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

even more efficient shuffling in test LUCENE-5868.patch

LUCENE-5868.patch

```diff Index: lucene/CHANGES.txt =================================================================== --- lucene/CHANGES.txt (revision 1718426) +++ lucene/CHANGES.txt (working copy) @@ -108,6 +108,12 @@ ======================= Lucene 5.5.0 ======================= +New Features + +* LUCENE-5868: JoinUtil.createJoinQuery(..,NumericType,..) query-time join + for LONG and INT fields with NUMERIC and SORTED_NUMERIC doc values. + (Alexey Zelin via Mikhail Khludnev) + API Changes * #7958: Grouping sortWithinGroup variables used to allow null to mean Index: lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (revision 0) +++ lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (working copy) @@ -0,0 +1,136 @@ +package org.apache.lucene.search.join; + +import java.io.IOException; +import java.util.function.LongConsumer; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.LeafReader; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.NumericDocValues; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.SimpleCollector; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.BytesRefBuilder; +import org.apache.lucene.util.NumericUtils; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +abstract class DocValuesTermsCollector extends SimpleCollector { + + @FunctionalInterface + static interface Function { + R apply(LeafReader t) throws IOException ; + } + + protected DV docValues; + private final Function docValuesCall; + + public DocValuesTermsCollector(Function docValuesCall) { + this.docValuesCall = docValuesCall; + } + + @Override + protected final void doSetNextReader(LeafReaderContext context) throws IOException { + docValues = docValuesCall.apply(context.reader()); + } + + static Function binaryDocValues(String field) { + return (ctx) -> DocValues.getBinary(ctx, field); + } + static Function sortedSetDocValues(String field) { + return (ctx) -> DocValues.getSortedSet(ctx, field); + } + + static Function numericAsBinaryDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final NumericDocValues numeric = DocValues.getNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new BinaryDocValues() { + @Override + public BytesRef get(int docID) { + final long lVal = numeric.get(docID); + coder.accept(lVal); + return bytes.get(); + } + }; + }; + } + + static LongConsumer coder(BytesRefBuilder bytes, NumericType type, String fieldName){ + switch(type){ + case INT: + return (l) -> NumericUtils.intToPrefixCoded((int)l, 0, bytes); + case LONG: + return (l) -> NumericUtils.longToPrefixCoded(l, 0, bytes); + default: + throw new IllegalArgumentException("Unsupported "+type+ + ". Only "+NumericType.INT+" and "+NumericType.LONG+" are supported." + + "Field "+fieldName ); + } + } + + /** this adapter is quite weird. ords are per doc index, don't use ords across different docs*/ + static Function sortedNumericAsSortedSetDocValues(String field, NumericType numTyp) { + return (ctx) -> { + final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new SortedSetDocValues() { + + private int index = Integer.MIN_VALUE; + + @Override + public long nextOrd() { + return index < numerics.count()-1 ? ++index : NO_MORE_ORDS; + } + + @Override + public void setDocument(int docID) { + numerics.setDocument(docID); + index=-1; + } + + @Override + public BytesRef lookupOrd(long ord) { + assert ord>=0 && ord mvFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.MV(mvFunction)); + case Avg: + return new MV.Avg(mvFunction); + default: + return new MV(mvFunction, mode); + } + } + + static Function verbose(PrintStream out, Function mvFunction){ + return (ctx) -> { + final SortedSetDocValues target = mvFunction.apply(ctx); + return new SortedSetDocValues() { + + @Override + public void setDocument(int docID) { + target.setDocument(docID); + out.println("\ndoc# "+docID); + } + + @Override + public long nextOrd() { + return target.nextOrd(); + } + + @Override + public BytesRef lookupOrd(long ord) { + final BytesRef val = target.lookupOrd(ord); + out.println(val.toString()+", "); + return val; + } + + @Override + public long getValueCount() { + return target.getValueCount(); + } + }; + + }; + } + + static GenericTermsCollector createCollectorSV(Function svFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.SV(svFunction)); + case Avg: + return new SV.Avg(svFunction); + default: + return new SV(svFunction, mode); + } + } + + static GenericTermsCollector wrap(final TermsCollector collector) { + return new GenericTermsCollector() { + + + @Override + public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { + return collector.getLeafCollector(context); + } + + @Override + public boolean needsScores() { + return collector.needsScores(); + } + + @Override + public BytesRefHash getCollectedTerms() { + return collector.getCollectorTerms(); + } + + @Override + public float[] getScoresPerTerm() { + throw new UnsupportedOperationException("scores are not available for "+collector); + } + }; + } +} Property changes on: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (working copy) @@ -1,5 +1,14 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Locale; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValuesType; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -21,13 +30,12 @@ import org.apache.lucene.index.LeafReader; import org.apache.lucene.index.MultiDocValues; import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchNoDocsQuery; import org.apache.lucene.search.Query; +import org.apache.lucene.search.join.DocValuesTermsCollector.Function; -import java.io.IOException; -import java.util.Locale; - /** * Utility for query time joining. * @@ -67,28 +75,87 @@ * @throws IOException If I/O related errors occur */ public static Query createJoinQuery(String fromField, - boolean multipleValuesPerDocument, - String toField, - Query fromQuery, - IndexSearcher fromSearcher, - ScoreMode scoreMode) throws IOException { + boolean multipleValuesPerDocument, + String toField, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsWithScoreCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedSetDocValues(fromField); + termsWithScoreCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.binaryDocValues(fromField); + termsWithScoreCollector = GenericTermsCollector.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsWithScoreCollector); + + } + + /** + * Method for query time joining for numeric fields. It supports multi- and single- values longs and ints. + * All considerations from {@link JoinUtil#createJoinQuery(String, boolean, String, Query, IndexSearcher, ScoreMode)} are applicable here too, + * though memory consumption might be higher. + *

+ * + * @param fromField The from field to join from + * @param multipleValuesPerDocument Whether the from field has multiple terms per document + * when true fromField might be {@link DocValuesType#SORTED_NUMERIC}, + * otherwise fromField should be {@link DocValuesType#NUMERIC} + * @param toField The to field to join to, should be {@link IntField} or {@link LongField} + * @param numericType either {@link NumericType#INT} or {@link NumericType#LONG}, it should correspond to fromField and toField types + * @param fromQuery The query to match documents on the from side + * @param fromSearcher The searcher that executed the specified fromQuery + * @param scoreMode Instructs how scores from the fromQuery are mapped to the returned query + * @return a {@link Query} instance that can be used to join documents based on the + * terms in the from and to field + * @throws IOException If I/O related errors occur + */ + + public static Query createJoinQuery(String fromField, + boolean multipleValuesPerDocument, + String toField, NumericType numericType, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedNumericAsSortedSetDocValues(fromField,numericType); + termsCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.numericAsBinaryDocValues(fromField,numericType); + termsCollector = GenericTermsCollector.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsCollector); + + } + + private static Query createJoinQuery(boolean multipleValuesPerDocument, String toField, Query fromQuery, + IndexSearcher fromSearcher, ScoreMode scoreMode, final GenericTermsCollector collector) + throws IOException { + + fromSearcher.search(fromQuery, collector); + switch (scoreMode) { case None: - TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument); - fromSearcher.search(fromQuery, termsCollector); - return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms()); + return new TermsQuery(toField, fromQuery, collector.getCollectedTerms()); case Total: case Max: case Min: case Avg: - TermsWithScoreCollector termsWithScoreCollector = - TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); - fromSearcher.search(fromQuery, termsWithScoreCollector); return new TermsIncludingScoreQuery( toField, multipleValuesPerDocument, - termsWithScoreCollector.getCollectedTerms(), - termsWithScoreCollector.getScoresPerTerm(), + collector.getCollectedTerms(), + collector.getScoresPerTerm(), fromQuery ); default: @@ -96,6 +163,7 @@ } } + /** * Delegates to {@link #createJoinQuery(String, Query, Query, IndexSearcher, ScoreMode, MultiDocValues.OrdinalMap, int, int)}, * but disables the min and max filtering. Index: lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (working copy) @@ -19,11 +19,8 @@ import java.io.IOException; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; @@ -32,19 +29,19 @@ * * @lucene.experimental */ -abstract class TermsCollector extends SimpleCollector { +abstract class TermsCollector extends DocValuesTermsCollector { - final String field; + TermsCollector(Function docValuesCall) { + super(docValuesCall); + } + final BytesRefHash collectorTerms = new BytesRefHash(); - TermsCollector(String field) { - this.field = field; - } - public BytesRefHash getCollectorTerms() { return collectorTerms; } + /** * Chooses the right {@link TermsCollector} implementation. * @@ -52,55 +49,42 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsCollector} instance */ - static TermsCollector create(String field, boolean multipleValuesPerDocument) { - return multipleValuesPerDocument ? new MV(field) : new SV(field); + static TermsCollector create(String field, boolean multipleValuesPerDocument) { + return multipleValuesPerDocument + ? new MV(sortedSetDocValues(field)) + : new SV(binaryDocValues(field)); } - + // impl that works with multiple values per document - static class MV extends TermsCollector { - final BytesRef scratch = new BytesRef(); - private SortedSetDocValues docTermOrds; - - MV(String field) { - super(field); + static class MV extends TermsCollector { + + MV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - docTermOrds.setDocument(doc); long ord; - while ((ord = docTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - final BytesRef term = docTermOrds.lookupOrd(ord); + docValues.setDocument(doc); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + final BytesRef term = docValues.lookupOrd(ord); collectorTerms.add(term); } } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - docTermOrds = DocValues.getSortedSet(context.reader(), field); - } } // impl that works with single value per document - static class SV extends TermsCollector { + static class SV extends TermsCollector { - final BytesRef spare = new BytesRef(); - private BinaryDocValues fromDocTerms; - - SV(String field) { - super(field); + SV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - final BytesRef term = fromDocTerms.get(doc); + final BytesRef term = docValues.get(doc); collectorTerms.add(term); } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } } @Override Index: lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (working copy) @@ -18,6 +18,7 @@ */ import java.io.IOException; +import java.io.PrintStream; import java.util.Locale; import java.util.Set; @@ -37,6 +38,7 @@ import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.NumericUtils; class TermsIncludingScoreQuery extends Query { @@ -268,5 +270,23 @@ } } } - + + void dump(PrintStream out){ + out.println(field+":"); + final BytesRef ref = new BytesRef(); + for (int i = 0; i < terms.size(); i++) { + terms.get(ords[i], ref); + out.print(ref+" "+ref.utf8ToString()+" "); + try { + out.print(Long.toHexString(NumericUtils.prefixCodedToLong(ref))+"L"); + } catch (Exception e) { + try { + out.print(Integer.toHexString(NumericUtils.prefixCodedToInt(ref))+"i"); + } catch (Exception ee) { + } + } + out.println(" score="+scores[ords[i]]); + out.println(""); + } + } } Index: lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (revision 1718426) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (working copy) @@ -1,5 +1,8 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Arrays; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -18,22 +21,16 @@ */ import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.BytesRefHash; -import java.io.IOException; -import java.util.Arrays; +abstract class TermsWithScoreCollector extends DocValuesTermsCollector + implements GenericTermsCollector { -abstract class TermsWithScoreCollector extends SimpleCollector { - private final static int INITIAL_ARRAY_SIZE = 0; - final String field; final BytesRefHash collectedTerms = new BytesRefHash(); final ScoreMode scoreMode; @@ -40,8 +37,8 @@ Scorer scorer; float[] scoreSums = new float[INITIAL_ARRAY_SIZE]; - TermsWithScoreCollector(String field, ScoreMode scoreMode) { - this.field = field; + TermsWithScoreCollector(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall); this.scoreMode = scoreMode; if (scoreMode == ScoreMode.Min) { Arrays.fill(scoreSums, Float.POSITIVE_INFINITY); @@ -50,10 +47,12 @@ } } + @Override public BytesRefHash getCollectedTerms() { return collectedTerms; } - + + @Override public float[] getScoresPerTerm() { return scoreSums; } @@ -70,36 +69,34 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsWithScoreCollector} instance */ - static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { + static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { if (multipleValuesPerDocument) { switch (scoreMode) { case Avg: - return new MV.Avg(field); + return new MV.Avg(sortedSetDocValues(field)); default: - return new MV(field, scoreMode); + return new MV(sortedSetDocValues(field), scoreMode); } } else { switch (scoreMode) { case Avg: - return new SV.Avg(field); + return new SV.Avg(binaryDocValues(field)); default: - return new SV(field, scoreMode); + return new SV(binaryDocValues(field), scoreMode); } } } - + // impl that works with single value per document - static class SV extends TermsWithScoreCollector { + static class SV extends TermsWithScoreCollector { - BinaryDocValues fromDocTerms; - - SV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + SV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -133,26 +130,23 @@ scoreSums[ord] = current; } break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } - static class Avg extends SV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -187,20 +181,18 @@ } // impl that works with multiple values per document - static class MV extends TermsWithScoreCollector { + static class MV extends TermsWithScoreCollector { - SortedSetDocValues fromDocTermOrds; - - MV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + MV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { @@ -225,29 +217,26 @@ case Max: scoreSums[termID] = Math.max(scoreSums[termID], scorer.score()); break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTermOrds = DocValues.getSortedSet(context.reader(), field); - } - static class Avg extends MV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { Index: lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java =================================================================== --- lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (revision 1718426) +++ lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (working copy) @@ -1,31 +1,30 @@ package org.apache.lucene.search.join; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.Comparator; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Random; +import java.util.Set; +import java.util.SortedSet; +import java.util.TreeSet; -import com.carrotsearch.randomizedtesting.generators.RandomInts; -import com.carrotsearch.randomizedtesting.generators.RandomPicks; - import org.apache.lucene.analysis.MockAnalyzer; import org.apache.lucene.analysis.MockTokenizer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; import org.apache.lucene.document.NumericDocValuesField; import org.apache.lucene.document.SortedDocValuesField; +import org.apache.lucene.document.SortedNumericDocValuesField; import org.apache.lucene.document.SortedSetDocValuesField; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; @@ -78,19 +77,26 @@ import org.apache.lucene.util.packed.PackedInts; import org.junit.Test; -import java.io.IOException; -import java.util.ArrayList; -import java.util.Collections; -import java.util.Comparator; -import java.util.HashMap; -import java.util.HashSet; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.Set; -import java.util.SortedSet; -import java.util.TreeSet; +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +import com.carrotsearch.randomizedtesting.generators.RandomInts; +import com.carrotsearch.randomizedtesting.generators.RandomPicks; + public class TestJoinUtil extends LuceneTestCase { public void testSimple() throws Exception { @@ -850,10 +856,18 @@ } final Query joinQuery; - if (from) { - joinQuery = JoinUtil.createJoinQuery("from", multipleValuesPerDocument, "to", actualQuery, indexSearcher, scoreMode); - } else { - joinQuery = JoinUtil.createJoinQuery("to", multipleValuesPerDocument, "from", actualQuery, indexSearcher, scoreMode); + { + // single val can be handled by multiple-vals + final boolean muliValsQuery = multipleValuesPerDocument || random().nextBoolean(); + final String fromField = from ? "from":"to"; + final String toField = from ? "to":"from"; + + if (random().nextBoolean()) { // numbers + final NumericType numType = random().nextBoolean() ? NumericType.INT: NumericType.LONG ; + joinQuery = JoinUtil.createJoinQuery(fromField+numType, muliValsQuery, toField+numType, numType, actualQuery, indexSearcher, scoreMode); + } else { + joinQuery = JoinUtil.createJoinQuery(fromField, muliValsQuery, toField, actualQuery, indexSearcher, scoreMode); + } } if (VERBOSE) { System.out.println("joinQuery=" + joinQuery); @@ -897,7 +911,6 @@ return; } - assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); if (VERBOSE) { for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { System.out.printf(Locale.ENGLISH, "Expected doc: %d | Actual doc: %d\n", expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -904,6 +917,7 @@ System.out.printf(Locale.ENGLISH, "Expected score: %f | Actual score: %f\n", expectedTopDocs.scoreDocs[i].score, actualTopDocs.scoreDocs[i].score); } } + assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { assertEquals(expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -919,14 +933,15 @@ } Directory dir = newDirectory(); + final Random random = random(); RandomIndexWriter w = new RandomIndexWriter( - random(), + random, dir, - newIndexWriterConfig(new MockAnalyzer(random(), MockTokenizer.KEYWORD, false)) + newIndexWriterConfig(new MockAnalyzer(random, MockTokenizer.KEYWORD, false)) ); IndexIterationContext context = new IndexIterationContext(); - int numRandomValues = nDocs / RandomInts.randomIntBetween(random(), 2, 10); + int numRandomValues = nDocs / RandomInts.randomIntBetween(random, 1, 4); context.randomUniqueValues = new String[numRandomValues]; Set trackSet = new HashSet<>(); context.randomFrom = new boolean[numRandomValues]; @@ -933,32 +948,46 @@ for (int i = 0; i < numRandomValues; i++) { String uniqueRandomValue; do { -// uniqueRandomValue = TestUtil.randomRealisticUnicodeString(random()); - uniqueRandomValue = TestUtil.randomSimpleString(random()); + // the trick is to generate values which will be ordered similarly for string, ints&longs, positive nums makes it easier + final int nextInt = random.nextInt(Integer.MAX_VALUE); + uniqueRandomValue = String.format(Locale.ROOT, "%08x", nextInt); + assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16); } while ("".equals(uniqueRandomValue) || trackSet.contains(uniqueRandomValue)); + // Generate unique values and empty strings aren't allowed. trackSet.add(uniqueRandomValue); - context.randomFrom[i] = random().nextBoolean(); + + context.randomFrom[i] = random.nextBoolean(); context.randomUniqueValues[i] = uniqueRandomValue; + } + List randomUniqueValuesReplica = new ArrayList<>(Arrays.asList(context.randomUniqueValues)); + RandomDoc[] docs = new RandomDoc[nDocs]; for (int i = 0; i < nDocs; i++) { String id = Integer.toString(i); - int randomI = random().nextInt(context.randomUniqueValues.length); + int randomI = random.nextInt(context.randomUniqueValues.length); String value = context.randomUniqueValues[randomI]; Document document = new Document(); - document.add(newTextField(random(), "id", id, Field.Store.YES)); - document.add(newTextField(random(), "value", value, Field.Store.NO)); + document.add(newTextField(random, "id", id, Field.Store.YES)); + document.add(newTextField(random, "value", value, Field.Store.NO)); boolean from = context.randomFrom[randomI]; - int numberOfLinkValues = multipleValuesPerDocument ? 2 + random().nextInt(10) : 1; + int numberOfLinkValues = multipleValuesPerDocument ? Math.min(2 + random.nextInt(10), context.randomUniqueValues.length) : 1; docs[i] = new RandomDoc(id, numberOfLinkValues, value, from); if (globalOrdinalJoin) { document.add(newStringField("type", from ? "from" : "to", Field.Store.NO)); } - for (int j = 0; j < numberOfLinkValues; j++) { - String linkValue = context.randomUniqueValues[random().nextInt(context.randomUniqueValues.length)]; + final List subValues; + { + int start = randomUniqueValuesReplica.size()==numberOfLinkValues? 0 : random.nextInt(randomUniqueValuesReplica.size()-numberOfLinkValues); + subValues = randomUniqueValuesReplica.subList(start, start+numberOfLinkValues); + Collections.shuffle(subValues, random); + } + for (String linkValue : subValues) { + + assert !docs[i].linkValues.contains(linkValue); docs[i].linkValues.add(linkValue); if (from) { if (!context.fromDocuments.containsKey(linkValue)) { @@ -970,15 +999,8 @@ context.fromDocuments.get(linkValue).add(docs[i]); context.randomValueFromDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "from", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("from", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("from", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "from", linkValue, multipleValuesPerDocument, globalOrdinalJoin); + } else { if (!context.toDocuments.containsKey(linkValue)) { context.toDocuments.put(linkValue, new ArrayList<>()); @@ -989,20 +1011,12 @@ context.toDocuments.get(linkValue).add(docs[i]); context.randomValueToDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "to", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("to", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("to", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "to", linkValue, multipleValuesPerDocument, globalOrdinalJoin); } } w.addDocument(document); - if (random().nextInt(10) == 4) { + if (random.nextInt(10) == 4) { w.commit(); } if (VERBOSE) { @@ -1010,7 +1024,7 @@ } } - if (random().nextBoolean()) { + if (random.nextBoolean()) { w.forceMerge(1); } w.close(); @@ -1185,6 +1199,30 @@ return context; } + private void addLinkFields(final Random random, Document document, final String fieldName, String linkValue, + boolean multipleValuesPerDocument, boolean globalOrdinalJoin) { + document.add(newTextField(random, fieldName, linkValue, Field.Store.NO)); + + final int linkInt = Integer.parseUnsignedInt(linkValue,16); + document.add(new IntField(fieldName+NumericType.INT, linkInt, Field.Store.NO)); + + final long linkLong = linkInt<<32 | linkInt; + document.add(new LongField(fieldName+NumericType.LONG, linkLong, Field.Store.NO)); + + if (multipleValuesPerDocument) { + document.add(new SortedSetDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } else { + document.add(new SortedDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new NumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new NumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } + if (globalOrdinalJoin) { + document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); + } + } + private TopDocs createExpectedTopDocs(String queryValue, final boolean from, final ScoreMode scoreMode, ```

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

I'm going to commit LUCENE-5868.patch in trunk and 5.x then in couple of hours.

applied all suggested changes
amended CHANGES.txt
checked precommit and javadocs

LUCENE-5868.patch

Index: lucene/CHANGES.txt
===================================================================
--- lucene/CHANGES.txt  (revision 1718426)
+++ lucene/CHANGES.txt  (working copy)
@@ -108,6 +108,12 @@

 ======================= Lucene 5.5.0 =======================

+New Features
+
+* LUCENE-5868: JoinUtil.createJoinQuery(..,NumericType,..) query-time join 
+  for LONG and INT fields with NUMERIC and SORTED_NUMERIC doc values.
+  (Alexey Zelin via Mikhail Khludnev) 
+
 API Changes

 * #7958: Grouping sortWithinGroup variables used to allow null to mean
Index: lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java
===================================================================
--- lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (revision 0)
+++ lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (working copy)
@@ -0,0 +1,136 @@
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import java.util.function.LongConsumer;
+
+import org.apache.lucene.document.FieldType.NumericType;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.NumericDocValues;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.search.SimpleCollector;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefBuilder;
+import org.apache.lucene.util.NumericUtils;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+abstract class DocValuesTermsCollector<DV> extends SimpleCollector {
+  
+  @FunctionalInterface
+  static interface Function<R> {
+      R apply(LeafReader t) throws IOException  ;
+  }
+  
+  protected DV docValues;
+  private final Function<DV> docValuesCall;
+  
+  public DocValuesTermsCollector(Function<DV> docValuesCall) {
+    this.docValuesCall = docValuesCall;
+  }
+
+  @Override
+  protected final void doSetNextReader(LeafReaderContext context) throws IOException {
+    docValues = docValuesCall.apply(context.reader());
+  }
+  
+  static Function<BinaryDocValues> binaryDocValues(String field) {
+      return (ctx) -> DocValues.getBinary(ctx, field);
+  }
+  static Function<SortedSetDocValues> sortedSetDocValues(String field) {
+    return (ctx) -> DocValues.getSortedSet(ctx, field);
+  }
+  
+  static Function<BinaryDocValues> numericAsBinaryDocValues(String field, NumericType numTyp) {
+    return (ctx) -> {
+      final NumericDocValues numeric = DocValues.getNumeric(ctx, field);
+      final BytesRefBuilder bytes = new BytesRefBuilder();
+      
+      final LongConsumer coder = coder(bytes, numTyp, field);
+      
+      return new BinaryDocValues() {
+        @Override
+        public BytesRef get(int docID) {
+          final long lVal = numeric.get(docID);
+          coder.accept(lVal);
+          return bytes.get();
+        }
+      };
+    };
+  }
+  
+  static LongConsumer coder(BytesRefBuilder bytes, NumericType type, String fieldName){
+    switch(type){
+      case INT: 
+        return (l) -> NumericUtils.intToPrefixCoded((int)l, 0, bytes);
+      case LONG: 
+        return (l) -> NumericUtils.longToPrefixCoded(l, 0, bytes);
+      default:
+        throw new IllegalArgumentException("Unsupported "+type+
+            ". Only "+NumericType.INT+" and "+NumericType.LONG+" are supported."
+            + "Field "+fieldName );
+    }
+  }
+  
+  /** this adapter is quite weird. ords are per doc index, don't use ords across different docs*/
+  static Function<SortedSetDocValues> sortedNumericAsSortedSetDocValues(String field, NumericType numTyp) {
+    return (ctx) -> {
+      final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field);
+      final BytesRefBuilder bytes = new BytesRefBuilder();
+      
+      final LongConsumer coder = coder(bytes, numTyp, field);
+      
+      return new SortedSetDocValues() {
+
+        private int index = Integer.MIN_VALUE;
+
+        @Override
+        public long nextOrd() {
+          return index < numerics.count()-1 ? ++index : NO_MORE_ORDS;
+        }
+
+        @Override
+        public void setDocument(int docID) {
+          numerics.setDocument(docID);
+          index=-1;
+        }
+
+        @Override
+        public BytesRef lookupOrd(long ord) {
+          assert ord>=0 && ord<numerics.count();
+          final long value = numerics.valueAt((int)ord);
+          coder.accept(value);
+          return bytes.get();
+        }
+
+        @Override
+        public long getValueCount() {
+          throw new UnsupportedOperationException("it's just number encoding wrapper");
+        }
+        
+        @Override
+        public long lookupTerm(BytesRef key) {
+          throw new UnsupportedOperationException("it's just number encoding wrapper");
+        }
+      };
+    };
+  }
+}

Property changes on: lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Index: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java
===================================================================
--- lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java   (revision 0)
+++ lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java   (working copy)
@@ -0,0 +1,123 @@
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import java.io.PrintStream;
+
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.search.Collector;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.join.DocValuesTermsCollector.Function;
+import org.apache.lucene.search.join.TermsWithScoreCollector.MV;
+import org.apache.lucene.search.join.TermsWithScoreCollector.SV;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.BytesRefHash;
+
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+interface GenericTermsCollector extends Collector {
+  
+  BytesRefHash getCollectedTerms() ;
+  
+  float[] getScoresPerTerm();
+  
+  static GenericTermsCollector createCollectorMV(Function<SortedSetDocValues> mvFunction,
+      ScoreMode mode) {
+    
+    switch (mode) {
+      case None:
+        return wrap(new TermsCollector.MV(mvFunction));
+      case Avg:
+          return new MV.Avg(mvFunction);
+      default:
+          return new MV(mvFunction, mode);
+    }
+  }
+
+  static Function<SortedSetDocValues> verbose(PrintStream out, Function<SortedSetDocValues> mvFunction){
+    return (ctx) -> {
+      final SortedSetDocValues target = mvFunction.apply(ctx);
+      return new SortedSetDocValues() {
+        
+        @Override
+        public void setDocument(int docID) {
+          target.setDocument(docID);
+          out.println("\ndoc# "+docID);
+        }
+        
+        @Override
+        public long nextOrd() {
+          return target.nextOrd();
+        }
+        
+        @Override
+        public BytesRef lookupOrd(long ord) {
+          final BytesRef val = target.lookupOrd(ord);
+          out.println(val.toString()+", ");
+          return val;
+        }
+        
+        @Override
+        public long getValueCount() {
+          return target.getValueCount();
+        }
+      };
+      
+    };
+  }
+
+  static GenericTermsCollector createCollectorSV(Function<BinaryDocValues> svFunction,
+      ScoreMode mode) {
+    
+    switch (mode) {
+      case None:
+        return wrap(new TermsCollector.SV(svFunction));
+      case Avg:
+        return new SV.Avg(svFunction);
+      default:
+        return new SV(svFunction, mode);  
+    }
+  }
+  
+  static GenericTermsCollector wrap(final TermsCollector<?> collector) {
+    return new GenericTermsCollector() {
+
+      
+      @Override
+      public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
+        return collector.getLeafCollector(context);
+      }
+
+      @Override
+      public boolean needsScores() {
+        return collector.needsScores();
+      }
+
+      @Override
+      public BytesRefHash getCollectedTerms() {
+        return collector.getCollectorTerms();
+      }
+
+      @Override
+      public float[] getScoresPerTerm() {
+        throw new UnsupportedOperationException("scores are not available for "+collector);
+      }
+    };
+  }
+}

Property changes on: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Index: lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java
===================================================================
--- lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java    (revision 1718426)
+++ lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java    (working copy)
@@ -1,5 +1,14 @@
 package org.apache.lucene.search.join;

+import java.io.IOException;
+import java.util.Locale;
+
+import org.apache.lucene.document.FieldType.NumericType;
+import org.apache.lucene.document.IntField;
+import org.apache.lucene.document.LongField;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValuesType;
+
 /*
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
@@ -21,13 +30,12 @@
 import org.apache.lucene.index.LeafReader;
 import org.apache.lucene.index.MultiDocValues;
 import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
 import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.search.MatchNoDocsQuery;
 import org.apache.lucene.search.Query;
+import org.apache.lucene.search.join.DocValuesTermsCollector.Function;

-import java.io.IOException;
-import java.util.Locale;
-
 /**
  * Utility for query time joining.
  *
@@ -67,28 +75,87 @@
    * @throws IOException If I/O related errors occur
    */
   public static Query createJoinQuery(String fromField,
-                                      boolean multipleValuesPerDocument,
-                                      String toField,
-                                      Query fromQuery,
-                                      IndexSearcher fromSearcher,
-                                      ScoreMode scoreMode) throws IOException {
+      boolean multipleValuesPerDocument,
+      String toField,
+      Query fromQuery,
+      IndexSearcher fromSearcher,
+      ScoreMode scoreMode) throws IOException {
+    
+    final GenericTermsCollector termsWithScoreCollector;
+     
+    if (multipleValuesPerDocument) {
+      Function<SortedSetDocValues> mvFunction = DocValuesTermsCollector.sortedSetDocValues(fromField);
+      termsWithScoreCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode);
+    } else {
+      Function<BinaryDocValues> svFunction = DocValuesTermsCollector.binaryDocValues(fromField);
+      termsWithScoreCollector =  GenericTermsCollector.createCollectorSV(svFunction, scoreMode);
+    }
+    
+    return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode,
+        termsWithScoreCollector);
+    
+  }
+  
+  /**
+   * Method for query time joining for numeric fields. It supports multi- and single- values longs and ints. 
+   * All considerations from {@link JoinUtil#createJoinQuery(String, boolean, String, Query, IndexSearcher, ScoreMode)} are applicable here too,
+   * though memory consumption might be higher.
+   * <p>
+   *
+   * @param fromField                 The from field to join from
+   * @param multipleValuesPerDocument Whether the from field has multiple terms per document
+   *                                  when true fromField might be {@link DocValuesType#SORTED_NUMERIC},
+   *                                  otherwise fromField should be {@link DocValuesType#NUMERIC}
+   * @param toField                   The to field to join to, should be {@link IntField} or {@link LongField}
+   * @param numericType               either {@link NumericType#INT} or {@link NumericType#LONG}, it should correspond to fromField and toField types
+   * @param fromQuery                 The query to match documents on the from side
+   * @param fromSearcher              The searcher that executed the specified fromQuery
+   * @param scoreMode                 Instructs how scores from the fromQuery are mapped to the returned query
+   * @return a {@link Query} instance that can be used to join documents based on the
+   *         terms in the from and to field
+   * @throws IOException If I/O related errors occur
+   */
+  
+  public static Query createJoinQuery(String fromField,
+      boolean multipleValuesPerDocument,
+      String toField, NumericType numericType,
+      Query fromQuery,
+      IndexSearcher fromSearcher,
+      ScoreMode scoreMode) throws IOException {
+    
+    final GenericTermsCollector termsCollector;
+     
+    if (multipleValuesPerDocument) {
+      Function<SortedSetDocValues> mvFunction = DocValuesTermsCollector.sortedNumericAsSortedSetDocValues(fromField,numericType);
+      termsCollector = GenericTermsCollector.createCollectorMV(mvFunction, scoreMode);
+    } else {
+      Function<BinaryDocValues> svFunction = DocValuesTermsCollector.numericAsBinaryDocValues(fromField,numericType);
+      termsCollector =  GenericTermsCollector.createCollectorSV(svFunction, scoreMode);
+    }
+    
+    return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode,
+        termsCollector);
+    
+  }
+  
+  private static Query createJoinQuery(boolean multipleValuesPerDocument, String toField, Query fromQuery,
+      IndexSearcher fromSearcher, ScoreMode scoreMode, final GenericTermsCollector collector)
+          throws IOException {
+    
+    fromSearcher.search(fromQuery, collector);
+    
     switch (scoreMode) {
       case None:
-        TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument);
-        fromSearcher.search(fromQuery, termsCollector);
-        return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms());
+        return new TermsQuery(toField, fromQuery, collector.getCollectedTerms());
       case Total:
       case Max:
       case Min:
       case Avg:
-        TermsWithScoreCollector termsWithScoreCollector =
-            TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode);
-        fromSearcher.search(fromQuery, termsWithScoreCollector);
         return new TermsIncludingScoreQuery(
             toField,
             multipleValuesPerDocument,
-            termsWithScoreCollector.getCollectedTerms(),
-            termsWithScoreCollector.getScoresPerTerm(),
+            collector.getCollectedTerms(),
+            collector.getScoresPerTerm(),
             fromQuery
         );
       default:
@@ -96,6 +163,7 @@
     }
   }

+
   /**
    * Delegates to {@link #createJoinQuery(String, Query, Query, IndexSearcher, ScoreMode, MultiDocValues.OrdinalMap, int, int)},
    * but disables the min and max filtering.
Index: lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java
===================================================================
--- lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java  (revision 1718426)
+++ lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java  (working copy)
@@ -19,11 +19,8 @@

 import java.io.IOException;

-import org.apache.lucene.index.LeafReaderContext;
 import org.apache.lucene.index.BinaryDocValues;
-import org.apache.lucene.index.DocValues;
 import org.apache.lucene.index.SortedSetDocValues;
-import org.apache.lucene.search.SimpleCollector;
 import org.apache.lucene.util.BytesRef;
 import org.apache.lucene.util.BytesRefHash;

@@ -32,19 +29,19 @@
  *
  * @lucene.experimental
  */
-abstract class TermsCollector extends SimpleCollector {
+abstract class TermsCollector<DV> extends DocValuesTermsCollector<DV> {

-  final String field;
+  TermsCollector(Function<DV> docValuesCall) {
+    super(docValuesCall);
+  }
+
   final BytesRefHash collectorTerms = new BytesRefHash();

-  TermsCollector(String field) {
-    this.field = field;
-  }
-
   public BytesRefHash getCollectorTerms() {
     return collectorTerms;
   }

+  
   /**
    * Chooses the right {@link TermsCollector} implementation.
    *
@@ -52,55 +49,42 @@
    * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document.
    * @return a {@link TermsCollector} instance
    */
-  static TermsCollector create(String field, boolean multipleValuesPerDocument) {
-    return multipleValuesPerDocument ? new MV(field) : new SV(field);
+  static TermsCollector<?> create(String field, boolean multipleValuesPerDocument) {
+    return multipleValuesPerDocument 
+        ? new MV(sortedSetDocValues(field))
+        : new SV(binaryDocValues(field));
   }
-
+  
   // impl that works with multiple values per document
-  static class MV extends TermsCollector {
-    final BytesRef scratch = new BytesRef();
-    private SortedSetDocValues docTermOrds;
-
-    MV(String field) {
-      super(field);
+  static class MV extends TermsCollector<SortedSetDocValues> {
+    
+    MV(Function<SortedSetDocValues> docValuesCall) {
+      super(docValuesCall);
     }

     @Override
     public void collect(int doc) throws IOException {
-      docTermOrds.setDocument(doc);
       long ord;
-      while ((ord = docTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) {
-        final BytesRef term = docTermOrds.lookupOrd(ord);
+      docValues.setDocument(doc);
+      while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) {
+        final BytesRef term = docValues.lookupOrd(ord);
         collectorTerms.add(term);
       }
     }
-
-    @Override
-    protected void doSetNextReader(LeafReaderContext context) throws IOException {
-      docTermOrds = DocValues.getSortedSet(context.reader(), field);
-    }
   }

   // impl that works with single value per document
-  static class SV extends TermsCollector {
+  static class SV extends TermsCollector<BinaryDocValues> {

-    final BytesRef spare = new BytesRef();
-    private BinaryDocValues fromDocTerms;
-
-    SV(String field) {
-      super(field);
+    SV(Function<BinaryDocValues> docValuesCall) {
+      super(docValuesCall);
     }

     @Override
     public void collect(int doc) throws IOException {
-      final BytesRef term = fromDocTerms.get(doc);
+      final BytesRef term = docValues.get(doc);
       collectorTerms.add(term);
     }
-
-    @Override
-    protected void doSetNextReader(LeafReaderContext context) throws IOException {
-      fromDocTerms = DocValues.getBinary(context.reader(), field);
-    }
   }

   @Override
Index: lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java
===================================================================
--- lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java    (revision 1718426)
+++ lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java    (working copy)
@@ -18,6 +18,7 @@
  */

 import java.io.IOException;
+import java.io.PrintStream;
 import java.util.Locale;
 import java.util.Set;

@@ -37,6 +38,7 @@
 import org.apache.lucene.util.BytesRef;
 import org.apache.lucene.util.BytesRefHash;
 import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.NumericUtils;

 class TermsIncludingScoreQuery extends Query {

@@ -268,5 +270,23 @@
       }
     }
   }
-
+  
+  void dump(PrintStream out){
+    out.println(field+":");
+    final BytesRef ref = new BytesRef();
+    for (int i = 0; i < terms.size(); i++) {
+      terms.get(ords[i], ref);
+      out.print(ref+" "+ref.utf8ToString()+" ");
+      try {
+        out.print(Long.toHexString(NumericUtils.prefixCodedToLong(ref))+"L");
+      } catch (Exception e) {
+        try {
+          out.print(Integer.toHexString(NumericUtils.prefixCodedToInt(ref))+"i");
+        } catch (Exception ee) {
+        }
+      }
+      out.println(" score="+scores[ords[i]]);
+      out.println("");
+    }
+  }
 }
Index: lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java
===================================================================
--- lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (revision 1718426)
+++ lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (working copy)
@@ -1,5 +1,8 @@
 package org.apache.lucene.search.join;

+import java.io.IOException;
+import java.util.Arrays;
+
 /*
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
@@ -18,22 +21,16 @@
  */

 import org.apache.lucene.index.BinaryDocValues;
-import org.apache.lucene.index.DocValues;
-import org.apache.lucene.index.LeafReaderContext;
 import org.apache.lucene.index.SortedSetDocValues;
 import org.apache.lucene.search.Scorer;
-import org.apache.lucene.search.SimpleCollector;
 import org.apache.lucene.util.ArrayUtil;
 import org.apache.lucene.util.BytesRefHash;

-import java.io.IOException;
-import java.util.Arrays;
+abstract class TermsWithScoreCollector<DV> extends DocValuesTermsCollector<DV> 
+                                    implements GenericTermsCollector {

-abstract class TermsWithScoreCollector extends SimpleCollector {
-
   private final static int INITIAL_ARRAY_SIZE = 0;

-  final String field;
   final BytesRefHash collectedTerms = new BytesRefHash();
   final ScoreMode scoreMode;

@@ -40,8 +37,8 @@
   Scorer scorer;
   float[] scoreSums = new float[INITIAL_ARRAY_SIZE];

-  TermsWithScoreCollector(String field, ScoreMode scoreMode) {
-    this.field = field;
+  TermsWithScoreCollector(Function<DV> docValuesCall, ScoreMode scoreMode) {
+    super(docValuesCall);
     this.scoreMode = scoreMode;
     if (scoreMode == ScoreMode.Min) {
       Arrays.fill(scoreSums, Float.POSITIVE_INFINITY);
@@ -50,10 +47,12 @@
     }
   }

+  @Override
   public BytesRefHash getCollectedTerms() {
     return collectedTerms;
   }
-
+ 
+  @Override
   public float[] getScoresPerTerm() {
     return scoreSums;
   }
@@ -70,36 +69,34 @@
    * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document.
    * @return a {@link TermsWithScoreCollector} instance
    */
-  static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) {
+  static TermsWithScoreCollector<?> create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) {
     if (multipleValuesPerDocument) {
       switch (scoreMode) {
         case Avg:
-          return new MV.Avg(field);
+          return new MV.Avg(sortedSetDocValues(field));
         default:
-          return new MV(field, scoreMode);
+          return new MV(sortedSetDocValues(field), scoreMode);
       }
     } else {
       switch (scoreMode) {
         case Avg:
-          return new SV.Avg(field);
+          return new SV.Avg(binaryDocValues(field));
         default:
-          return new SV(field, scoreMode);
+          return new SV(binaryDocValues(field), scoreMode);
       }
     }
   }
-
+ 
   // impl that works with single value per document
-  static class SV extends TermsWithScoreCollector {
+  static class SV extends TermsWithScoreCollector<BinaryDocValues> {

-    BinaryDocValues fromDocTerms;
-
-    SV(String field, ScoreMode scoreMode) {
-      super(field, scoreMode);
+    SV(Function<BinaryDocValues> docValuesCall, ScoreMode scoreMode) {
+      super(docValuesCall, scoreMode);
     }

     @Override
     public void collect(int doc) throws IOException {
-      int ord = collectedTerms.add(fromDocTerms.get(doc));
+      int ord = collectedTerms.add(docValues.get(doc));
       if (ord < 0) {
         ord = -ord - 1;
       } else {
@@ -133,26 +130,23 @@
               scoreSums[ord] = current;
             }
             break;
+          default:
+            throw new AssertionError("unexpected: " + scoreMode);
         }
       }
     }

-    @Override
-    protected void doSetNextReader(LeafReaderContext context) throws IOException {
-      fromDocTerms = DocValues.getBinary(context.reader(), field);
-    }
-
     static class Avg extends SV {

       int[] scoreCounts = new int[INITIAL_ARRAY_SIZE];

-      Avg(String field) {
-        super(field, ScoreMode.Avg);
+      Avg(Function<BinaryDocValues> docValuesCall) {
+        super(docValuesCall, ScoreMode.Avg);
       }

       @Override
       public void collect(int doc) throws IOException {
-        int ord = collectedTerms.add(fromDocTerms.get(doc));
+        int ord = collectedTerms.add(docValues.get(doc));
         if (ord < 0) {
           ord = -ord - 1;
         } else {
@@ -187,20 +181,18 @@
   }

   // impl that works with multiple values per document
-  static class MV extends TermsWithScoreCollector {
+  static class MV extends TermsWithScoreCollector<SortedSetDocValues> {

-    SortedSetDocValues fromDocTermOrds;
-
-    MV(String field, ScoreMode scoreMode) {
-      super(field, scoreMode);
+    MV(Function<SortedSetDocValues> docValuesCall, ScoreMode scoreMode) {
+      super(docValuesCall, scoreMode);
     }

     @Override
     public void collect(int doc) throws IOException {
-      fromDocTermOrds.setDocument(doc);
+      docValues.setDocument(doc);
       long ord;
-      while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) {
-        int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord));
+      while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) {
+        int termID = collectedTerms.add(docValues.lookupOrd(ord));
         if (termID < 0) {
           termID = -termID - 1;
         } else {
@@ -225,29 +217,26 @@
           case Max:
             scoreSums[termID] = Math.max(scoreSums[termID], scorer.score());
             break;
+          default:
+            throw new AssertionError("unexpected: " + scoreMode);
         }
       }
     }

-    @Override
-    protected void doSetNextReader(LeafReaderContext context) throws IOException {
-      fromDocTermOrds = DocValues.getSortedSet(context.reader(), field);
-    }
-
     static class Avg extends MV {

       int[] scoreCounts = new int[INITIAL_ARRAY_SIZE];

-      Avg(String field) {
-        super(field, ScoreMode.Avg);
+      Avg(Function<SortedSetDocValues> docValuesCall) {
+        super(docValuesCall, ScoreMode.Avg);
       }

       @Override
       public void collect(int doc) throws IOException {
-        fromDocTermOrds.setDocument(doc);
+        docValues.setDocument(doc);
         long ord;
-        while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) {
-          int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord));
+        while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) {
+          int termID = collectedTerms.add(docValues.lookupOrd(ord));
           if (termID < 0) {
             termID = -termID - 1;
           } else {
Index: lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java
===================================================================
--- lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java    (revision 1718426)
+++ lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java    (working copy)
@@ -1,31 +1,30 @@
 package org.apache.lucene.search.join;

-/*
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements.  See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Comparator;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Random;
+import java.util.Set;
+import java.util.SortedSet;
+import java.util.TreeSet;

-import com.carrotsearch.randomizedtesting.generators.RandomInts;
-import com.carrotsearch.randomizedtesting.generators.RandomPicks;
-
 import org.apache.lucene.analysis.MockAnalyzer;
 import org.apache.lucene.analysis.MockTokenizer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
+import org.apache.lucene.document.FieldType.NumericType;
+import org.apache.lucene.document.IntField;
+import org.apache.lucene.document.LongField;
 import org.apache.lucene.document.NumericDocValuesField;
 import org.apache.lucene.document.SortedDocValuesField;
+import org.apache.lucene.document.SortedNumericDocValuesField;
 import org.apache.lucene.document.SortedSetDocValuesField;
 import org.apache.lucene.document.StringField;
 import org.apache.lucene.document.TextField;
@@ -78,19 +77,26 @@
 import org.apache.lucene.util.packed.PackedInts;
 import org.junit.Test;

-import java.io.IOException;
-import java.util.ArrayList;
-import java.util.Collections;
-import java.util.Comparator;
-import java.util.HashMap;
-import java.util.HashSet;
-import java.util.List;
-import java.util.Locale;
-import java.util.Map;
-import java.util.Set;
-import java.util.SortedSet;
-import java.util.TreeSet;
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */

+import com.carrotsearch.randomizedtesting.generators.RandomInts;
+import com.carrotsearch.randomizedtesting.generators.RandomPicks;
+
 public class TestJoinUtil extends LuceneTestCase {

   public void testSimple() throws Exception {
@@ -850,10 +856,18 @@
         }

         final Query joinQuery;
-        if (from) {
-          joinQuery = JoinUtil.createJoinQuery("from", multipleValuesPerDocument, "to", actualQuery, indexSearcher, scoreMode);
-        } else {
-          joinQuery = JoinUtil.createJoinQuery("to", multipleValuesPerDocument, "from", actualQuery, indexSearcher, scoreMode);
+        {
+          // single val can be handled by multiple-vals 
+          final boolean muliValsQuery = multipleValuesPerDocument || random().nextBoolean();
+          final String fromField = from ? "from":"to"; 
+          final String toField = from ? "to":"from"; 
+          
+          if (random().nextBoolean()) { // numbers
+            final NumericType numType = random().nextBoolean() ? NumericType.INT: NumericType.LONG ;
+            joinQuery = JoinUtil.createJoinQuery(fromField+numType, muliValsQuery, toField+numType, numType, actualQuery, indexSearcher, scoreMode);
+          } else {
+            joinQuery = JoinUtil.createJoinQuery(fromField, muliValsQuery, toField, actualQuery, indexSearcher, scoreMode);
+          }
         }
         if (VERBOSE) {
           System.out.println("joinQuery=" + joinQuery);
@@ -897,7 +911,6 @@
       return;
     }

-    assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f);
     if (VERBOSE) {
       for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) {
         System.out.printf(Locale.ENGLISH, "Expected doc: %d | Actual doc: %d\n", expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc);
@@ -904,6 +917,7 @@
         System.out.printf(Locale.ENGLISH, "Expected score: %f | Actual score: %f\n", expectedTopDocs.scoreDocs[i].score, actualTopDocs.scoreDocs[i].score);
       }
     }
+    assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f);

     for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) {
       assertEquals(expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc);
@@ -919,14 +933,15 @@
     }

     Directory dir = newDirectory();
+    final Random random = random();
     RandomIndexWriter w = new RandomIndexWriter(
-        random(),
+        random,
         dir,
-        newIndexWriterConfig(new MockAnalyzer(random(), MockTokenizer.KEYWORD, false))
+        newIndexWriterConfig(new MockAnalyzer(random, MockTokenizer.KEYWORD, false))
     );

     IndexIterationContext context = new IndexIterationContext();
-    int numRandomValues = nDocs / RandomInts.randomIntBetween(random(), 2, 10);
+    int numRandomValues = nDocs / RandomInts.randomIntBetween(random, 1, 4);
     context.randomUniqueValues = new String[numRandomValues];
     Set<String> trackSet = new HashSet<>();
     context.randomFrom = new boolean[numRandomValues];
@@ -933,32 +948,46 @@
     for (int i = 0; i < numRandomValues; i++) {
       String uniqueRandomValue;
       do {
-//        uniqueRandomValue = TestUtil.randomRealisticUnicodeString(random());
-        uniqueRandomValue = TestUtil.randomSimpleString(random());
+        // the trick is to generate values which will be ordered similarly for string, ints&longs, positive nums makes it easier 
+        final int nextInt = random.nextInt(Integer.MAX_VALUE);
+        uniqueRandomValue = String.format(Locale.ROOT, "%08x", nextInt);
+        assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16);
       } while ("".equals(uniqueRandomValue) || trackSet.contains(uniqueRandomValue));
+     
       // Generate unique values and empty strings aren't allowed.
       trackSet.add(uniqueRandomValue);
-      context.randomFrom[i] = random().nextBoolean();
+      
+      context.randomFrom[i] = random.nextBoolean();
       context.randomUniqueValues[i] = uniqueRandomValue;
+      
     }

+    List<String> randomUniqueValuesReplica = new ArrayList<>(Arrays.asList(context.randomUniqueValues));
+        
     RandomDoc[] docs = new RandomDoc[nDocs];
     for (int i = 0; i < nDocs; i++) {
       String id = Integer.toString(i);
-      int randomI = random().nextInt(context.randomUniqueValues.length);
+      int randomI = random.nextInt(context.randomUniqueValues.length);
       String value = context.randomUniqueValues[randomI];
       Document document = new Document();
-      document.add(newTextField(random(), "id", id, Field.Store.YES));
-      document.add(newTextField(random(), "value", value, Field.Store.NO));
+      document.add(newTextField(random, "id", id, Field.Store.YES));
+      document.add(newTextField(random, "value", value, Field.Store.NO));

       boolean from = context.randomFrom[randomI];
-      int numberOfLinkValues = multipleValuesPerDocument ? 2 + random().nextInt(10) : 1;
+      int numberOfLinkValues = multipleValuesPerDocument ? Math.min(2 + random.nextInt(10), context.randomUniqueValues.length) : 1;
       docs[i] = new RandomDoc(id, numberOfLinkValues, value, from);
       if (globalOrdinalJoin) {
         document.add(newStringField("type", from ? "from" : "to", Field.Store.NO));
       }
-      for (int j = 0; j < numberOfLinkValues; j++) {
-        String linkValue = context.randomUniqueValues[random().nextInt(context.randomUniqueValues.length)];
+      final List<String> subValues;
+      {
+      int start = randomUniqueValuesReplica.size()==numberOfLinkValues? 0 : random.nextInt(randomUniqueValuesReplica.size()-numberOfLinkValues);
+      subValues = randomUniqueValuesReplica.subList(start, start+numberOfLinkValues);
+      Collections.shuffle(subValues, random);
+      }
+      for (String linkValue : subValues) {
+        
+        assert !docs[i].linkValues.contains(linkValue);
         docs[i].linkValues.add(linkValue);
         if (from) {
           if (!context.fromDocuments.containsKey(linkValue)) {
@@ -970,15 +999,8 @@

           context.fromDocuments.get(linkValue).add(docs[i]);
           context.randomValueFromDocs.get(value).add(docs[i]);
-          document.add(newTextField(random(), "from", linkValue, Field.Store.NO));
-          if (multipleValuesPerDocument) {
-            document.add(new SortedSetDocValuesField("from", new BytesRef(linkValue)));
-          } else {
-            document.add(new SortedDocValuesField("from", new BytesRef(linkValue)));
-          }
-          if (globalOrdinalJoin) {
-            document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue)));
-          }
+          addLinkFields(random, document,  "from", linkValue, multipleValuesPerDocument, globalOrdinalJoin);
+          
         } else {
           if (!context.toDocuments.containsKey(linkValue)) {
             context.toDocuments.put(linkValue, new ArrayList<>());
@@ -989,20 +1011,12 @@

           context.toDocuments.get(linkValue).add(docs[i]);
           context.randomValueToDocs.get(value).add(docs[i]);
-          document.add(newTextField(random(), "to", linkValue, Field.Store.NO));
-          if (multipleValuesPerDocument) {
-            document.add(new SortedSetDocValuesField("to", new BytesRef(linkValue)));
-          } else {
-            document.add(new SortedDocValuesField("to", new BytesRef(linkValue)));
-          }
-          if (globalOrdinalJoin) {
-            document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue)));
-          }
+          addLinkFields(random, document,  "to", linkValue, multipleValuesPerDocument, globalOrdinalJoin);
         }
       }

       w.addDocument(document);
-      if (random().nextInt(10) == 4) {
+      if (random.nextInt(10) == 4) {
         w.commit();
       }
       if (VERBOSE) {
@@ -1010,7 +1024,7 @@
       }
     }

-    if (random().nextBoolean()) {
+    if (random.nextBoolean()) {
       w.forceMerge(1);
     }
     w.close();
@@ -1185,6 +1199,30 @@
     return context;
   }

+  private void addLinkFields(final Random random, Document document, final String fieldName, String linkValue,
+      boolean multipleValuesPerDocument, boolean globalOrdinalJoin) {
+    document.add(newTextField(random, fieldName, linkValue, Field.Store.NO));
+
+    final int linkInt = Integer.parseUnsignedInt(linkValue,16);
+    document.add(new IntField(fieldName+NumericType.INT, linkInt, Field.Store.NO));
+
+    final long linkLong = linkInt<<32 | linkInt;
+    document.add(new LongField(fieldName+NumericType.LONG, linkLong, Field.Store.NO));
+
+    if (multipleValuesPerDocument) {
+      document.add(new SortedSetDocValuesField(fieldName, new BytesRef(linkValue)));
+      document.add(new SortedNumericDocValuesField(fieldName+NumericType.INT, linkInt));
+      document.add(new SortedNumericDocValuesField(fieldName+NumericType.LONG, linkLong));
+    } else {
+      document.add(new SortedDocValuesField(fieldName, new BytesRef(linkValue)));
+      document.add(new NumericDocValuesField(fieldName+NumericType.INT, linkInt));
+      document.add(new NumericDocValuesField(fieldName+NumericType.LONG, linkLong));
+    }
+    if (globalOrdinalJoin) {
+      document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue)));
+    }
+  }
+
   private TopDocs createExpectedTopDocs(String queryValue,
                                         final boolean from,
                                         final ScoreMode scoreMode,

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1718443 from mkhl@apache.org in branch 'dev/trunk' https://svn.apache.org/r1718443

LUCENE-5868: query-time join for numerics

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

oh.my... there are no λ-s in 5.x. the patch is gonna missing its' most of beauty

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

moved to java7, got LUCENE-5868-5x.patch

LUCENE-5868-5x.patch

```diff Index: lucene/CHANGES.txt =================================================================== --- lucene/CHANGES.txt (revision 1718443) +++ lucene/CHANGES.txt (working copy) @@ -6,6 +6,12 @@ ======================= Lucene 5.5.0 ======================= +New Features + +* LUCENE-5868: JoinUtil.createJoinQuery(..,NumericType,..) query-time join + for LONG and INT fields with NUMERIC and SORTED_NUMERIC doc values. + (Alexey Zelin via Mikhail Khludnev) + API Changes * #7958: Grouping sortWithinGroup variables used to allow null to mean Property changes on: lucene/CHANGES.txt ___________________________________________________________________ Modified: svn:mergeinfo Merged /lucene/dev/trunk/lucene/CHANGES.txt:r1718443 Index: lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (working copy) @@ -1,7 +1,6 @@ package org.apache.lucene.search.join; import java.io.IOException; -import java.util.function.LongConsumer; import org.apache.lucene.document.FieldType.NumericType; import org.apache.lucene.index.BinaryDocValues; @@ -35,11 +34,14 @@ abstract class DocValuesTermsCollector extends SimpleCollector { - @FunctionalInterface static interface Function { R apply(LeafReader t) throws IOException ; } + static interface LongConsumer { + void accept(long value); + } + protected DV docValues; private final Function docValuesCall; @@ -52,37 +54,62 @@ docValues = docValuesCall.apply(context.reader()); } - static Function binaryDocValues(String field) { - return (ctx) -> DocValues.getBinary(ctx, field); - } - static Function sortedSetDocValues(String field) { - return (ctx) -> DocValues.getSortedSet(ctx, field); - } - - static Function numericAsBinaryDocValues(String field, NumericType numTyp) { - return (ctx) -> { - final NumericDocValues numeric = DocValues.getNumeric(ctx, field); - final BytesRefBuilder bytes = new BytesRefBuilder(); - - final LongConsumer coder = coder(bytes, numTyp, field); - - return new BinaryDocValues() { + static Function binaryDocValues(final String field) { + return new Function() + { @Override - public BytesRef get(int docID) { - final long lVal = numeric.get(docID); - coder.accept(lVal); - return bytes.get(); + public BinaryDocValues apply(LeafReader ctx) throws IOException { + return DocValues.getBinary(ctx, field); } }; + } + static Function sortedSetDocValues(final String field) { + return new Function() + { + @Override + public SortedSetDocValues apply(LeafReader ctx) throws IOException { + return DocValues.getSortedSet(ctx, field); + } }; } - static LongConsumer coder(BytesRefBuilder bytes, NumericType type, String fieldName){ + static Function numericAsBinaryDocValues(final String field, final NumericType numTyp) { + return new Function() { + @Override + public BinaryDocValues apply(LeafReader ctx) throws IOException { + final NumericDocValues numeric = DocValues.getNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new BinaryDocValues() { + @Override + public BytesRef get(int docID) { + final long lVal = numeric.get(docID); + coder.accept(lVal); + return bytes.get(); + } + }; + } + }; + } + + static LongConsumer coder(final BytesRefBuilder bytes, NumericType type, String fieldName){ switch(type){ case INT: - return (l) -> NumericUtils.intToPrefixCoded((int)l, 0, bytes); + return new LongConsumer() { + @Override + public void accept(long value) { + NumericUtils.intToPrefixCoded((int)value, 0, bytes); + } + }; case LONG: - return (l) -> NumericUtils.longToPrefixCoded(l, 0, bytes); + return new LongConsumer() { + @Override + public void accept(long value) { + NumericUtils.longToPrefixCoded((int)value, 0, bytes); + } + }; default: throw new IllegalArgumentException("Unsupported "+type+ ". Only "+NumericType.INT+" and "+NumericType.LONG+" are supported." @@ -91,46 +118,49 @@ } /** this adapter is quite weird. ords are per doc index, don't use ords across different docs*/ - static Function sortedNumericAsSortedSetDocValues(String field, NumericType numTyp) { - return (ctx) -> { - final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field); - final BytesRefBuilder bytes = new BytesRefBuilder(); - - final LongConsumer coder = coder(bytes, numTyp, field); - - return new SortedSetDocValues() { - - private int index = Integer.MIN_VALUE; - - @Override - public long nextOrd() { - return index < numerics.count()-1 ? ++index : NO_MORE_ORDS; - } - - @Override - public void setDocument(int docID) { - numerics.setDocument(docID); - index=-1; - } - - @Override - public BytesRef lookupOrd(long ord) { - assert ord>=0 && ord sortedNumericAsSortedSetDocValues(final String field, final NumericType numTyp) { + return new Function() { + @Override + public SortedSetDocValues apply(LeafReader ctx) throws IOException { + final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); - @Override - public long lookupTerm(BytesRef key) { - throw new UnsupportedOperationException("it's just number encoding wrapper"); - } - }; + final LongConsumer coder = coder(bytes, numTyp, field); + + return new SortedSetDocValues() { + + private int index = Integer.MIN_VALUE; + + @Override + public long nextOrd() { + return index < numerics.count() - 1 ? ++index : NO_MORE_ORDS; + } + + @Override + public void setDocument(int docID) { + numerics.setDocument(docID); + index = -1; + } + + @Override + public BytesRef lookupOrd(long ord) { + assert ord >= 0 && ord < numerics.count(); + final long value = numerics.valueAt((int) ord); + coder.accept(value); + return bytes.get(); + } + + @Override + public long getValueCount() { + throw new UnsupportedOperationException("it's just number encoding wrapper"); + } + + @Override + public long lookupTerm(BytesRef key) { + throw new UnsupportedOperationException("it's just number encoding wrapper"); + } + }; + } }; } } Index: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java (working copy) @@ -1,17 +1,6 @@ package org.apache.lucene.search.join; -import java.io.IOException; -import java.io.PrintStream; - -import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.Collector; -import org.apache.lucene.search.LeafCollector; -import org.apache.lucene.search.join.DocValuesTermsCollector.Function; -import org.apache.lucene.search.join.TermsWithScoreCollector.MV; -import org.apache.lucene.search.join.TermsWithScoreCollector.SV; -import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; /* @@ -37,87 +26,4 @@ float[] getScoresPerTerm(); - static GenericTermsCollector createCollectorMV(Function mvFunction, - ScoreMode mode) { - - switch (mode) { - case None: - return wrap(new TermsCollector.MV(mvFunction)); - case Avg: - return new MV.Avg(mvFunction); - default: - return new MV(mvFunction, mode); - } - } - - static Function verbose(PrintStream out, Function mvFunction){ - return (ctx) -> { - final SortedSetDocValues target = mvFunction.apply(ctx); - return new SortedSetDocValues() { - - @Override - public void setDocument(int docID) { - target.setDocument(docID); - out.println("\ndoc# "+docID); - } - - @Override - public long nextOrd() { - return target.nextOrd(); - } - - @Override - public BytesRef lookupOrd(long ord) { - final BytesRef val = target.lookupOrd(ord); - out.println(val.toString()+", "); - return val; - } - - @Override - public long getValueCount() { - return target.getValueCount(); - } - }; - - }; - } - - static GenericTermsCollector createCollectorSV(Function svFunction, - ScoreMode mode) { - - switch (mode) { - case None: - return wrap(new TermsCollector.SV(svFunction)); - case Avg: - return new SV.Avg(svFunction); - default: - return new SV(svFunction, mode); - } - } - - static GenericTermsCollector wrap(final TermsCollector collector) { - return new GenericTermsCollector() { - - - @Override - public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { - return collector.getLeafCollector(context); - } - - @Override - public boolean needsScores() { - return collector.needsScores(); - } - - @Override - public BytesRefHash getCollectedTerms() { - return collector.getCollectorTerms(); - } - - @Override - public float[] getScoresPerTerm() { - throw new UnsupportedOperationException("scores are not available for "+collector); - } - }; - } } Index: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollectorFactory.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollectorFactory.java (revision 0) +++ lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollectorFactory.java (working copy) @@ -0,0 +1,86 @@ +package org.apache.lucene.search.join; + +import java.io.IOException; + +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.join.DocValuesTermsCollector.Function; +import org.apache.lucene.search.join.TermsWithScoreCollector.MV; +import org.apache.lucene.search.join.TermsWithScoreCollector.SV; +import org.apache.lucene.util.BytesRefHash; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +final class GenericTermsCollectorFactory { + + private GenericTermsCollectorFactory() {} + + static GenericTermsCollector createCollectorMV(Function mvFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.MV(mvFunction)); + case Avg: + return new MV.Avg(mvFunction); + default: + return new MV(mvFunction, mode); + } + } + + static GenericTermsCollector createCollectorSV(Function svFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.SV(svFunction)); + case Avg: + return new SV.Avg(svFunction); + default: + return new SV(svFunction, mode); + } + } + + static GenericTermsCollector wrap(final TermsCollector collector) { + return new GenericTermsCollector() { + + + @Override + public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { + return collector.getLeafCollector(context); + } + + @Override + public boolean needsScores() { + return collector.needsScores(); + } + + @Override + public BytesRefHash getCollectedTerms() { + return collector.getCollectorTerms(); + } + + @Override + public float[] getScoresPerTerm() { + throw new UnsupportedOperationException("scores are not available for "+collector); + } + }; + } +} Property changes on: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollectorFactory.java ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (working copy) @@ -1,5 +1,14 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Locale; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValuesType; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -24,9 +33,11 @@ import org.apache.lucene.index.LeafReader; import org.apache.lucene.index.MultiDocValues; import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchNoDocsQuery; import org.apache.lucene.search.Query; +import org.apache.lucene.search.join.DocValuesTermsCollector.Function; /** * Utility for query time joining. @@ -67,28 +78,87 @@ * @throws IOException If I/O related errors occur */ public static Query createJoinQuery(String fromField, - boolean multipleValuesPerDocument, - String toField, - Query fromQuery, - IndexSearcher fromSearcher, - ScoreMode scoreMode) throws IOException { + boolean multipleValuesPerDocument, + String toField, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsWithScoreCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedSetDocValues(fromField); + termsWithScoreCollector = GenericTermsCollectorFactory.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.binaryDocValues(fromField); + termsWithScoreCollector = GenericTermsCollectorFactory.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsWithScoreCollector); + + } + + /** + * Method for query time joining for numeric fields. It supports multi- and single- values longs and ints. + * All considerations from {@link JoinUtil#createJoinQuery(String, boolean, String, Query, IndexSearcher, ScoreMode)} are applicable here too, + * though memory consumption might be higher. + *

+ * + * @param fromField The from field to join from + * @param multipleValuesPerDocument Whether the from field has multiple terms per document + * when true fromField might be {@link DocValuesType#SORTED_NUMERIC}, + * otherwise fromField should be {@link DocValuesType#NUMERIC} + * @param toField The to field to join to, should be {@link IntField} or {@link LongField} + * @param numericType either {@link NumericType#INT} or {@link NumericType#LONG}, it should correspond to fromField and toField types + * @param fromQuery The query to match documents on the from side + * @param fromSearcher The searcher that executed the specified fromQuery + * @param scoreMode Instructs how scores from the fromQuery are mapped to the returned query + * @return a {@link Query} instance that can be used to join documents based on the + * terms in the from and to field + * @throws IOException If I/O related errors occur + */ + + public static Query createJoinQuery(String fromField, + boolean multipleValuesPerDocument, + String toField, NumericType numericType, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedNumericAsSortedSetDocValues(fromField,numericType); + termsCollector = GenericTermsCollectorFactory.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.numericAsBinaryDocValues(fromField,numericType); + termsCollector = GenericTermsCollectorFactory.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsCollector); + + } + + private static Query createJoinQuery(boolean multipleValuesPerDocument, String toField, Query fromQuery, + IndexSearcher fromSearcher, ScoreMode scoreMode, final GenericTermsCollector collector) + throws IOException { + + fromSearcher.search(fromQuery, collector); + switch (scoreMode) { case None: - TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument); - fromSearcher.search(fromQuery, termsCollector); - return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms()); + return new TermsQuery(toField, fromQuery, collector.getCollectedTerms()); case Total: case Max: case Min: case Avg: - TermsWithScoreCollector termsWithScoreCollector = - TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); - fromSearcher.search(fromQuery, termsWithScoreCollector); return new TermsIncludingScoreQuery( toField, multipleValuesPerDocument, - termsWithScoreCollector.getCollectedTerms(), - termsWithScoreCollector.getScoresPerTerm(), + collector.getCollectedTerms(), + collector.getScoresPerTerm(), fromQuery ); default: @@ -96,6 +166,7 @@ } } + /** * Delegates to {@link #createJoinQuery(String, Query, Query, IndexSearcher, ScoreMode, MultiDocValues.OrdinalMap, int, int)}, * but disables the min and max filtering. Index: lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (working copy) @@ -19,11 +19,8 @@ import java.io.IOException; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; @@ -32,19 +29,19 @@ * * @lucene.experimental */ -abstract class TermsCollector extends SimpleCollector { +abstract class TermsCollector extends DocValuesTermsCollector { - final String field; + TermsCollector(Function docValuesCall) { + super(docValuesCall); + } + final BytesRefHash collectorTerms = new BytesRefHash(); - TermsCollector(String field) { - this.field = field; - } - public BytesRefHash getCollectorTerms() { return collectorTerms; } + /** * Chooses the right {@link TermsCollector} implementation. * @@ -52,55 +49,42 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsCollector} instance */ - static TermsCollector create(String field, boolean multipleValuesPerDocument) { - return multipleValuesPerDocument ? new MV(field) : new SV(field); + static TermsCollector create(String field, boolean multipleValuesPerDocument) { + return multipleValuesPerDocument + ? new MV(sortedSetDocValues(field)) + : new SV(binaryDocValues(field)); } - + // impl that works with multiple values per document - static class MV extends TermsCollector { - final BytesRef scratch = new BytesRef(); - private SortedSetDocValues docTermOrds; - - MV(String field) { - super(field); + static class MV extends TermsCollector { + + MV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - docTermOrds.setDocument(doc); long ord; - while ((ord = docTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - final BytesRef term = docTermOrds.lookupOrd(ord); + docValues.setDocument(doc); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + final BytesRef term = docValues.lookupOrd(ord); collectorTerms.add(term); } } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - docTermOrds = DocValues.getSortedSet(context.reader(), field); - } } // impl that works with single value per document - static class SV extends TermsCollector { + static class SV extends TermsCollector { - final BytesRef spare = new BytesRef(); - private BinaryDocValues fromDocTerms; - - SV(String field) { - super(field); + SV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - final BytesRef term = fromDocTerms.get(doc); + final BytesRef term = docValues.get(doc); collectorTerms.add(term); } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } } @Override Index: lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (working copy) @@ -18,6 +18,7 @@ */ import java.io.IOException; +import java.io.PrintStream; import java.util.Locale; import java.util.Set; @@ -37,6 +38,7 @@ import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.NumericUtils; class TermsIncludingScoreQuery extends Query { @@ -271,5 +273,23 @@ } } } - + + void dump(PrintStream out){ + out.println(field+":"); + final BytesRef ref = new BytesRef(); + for (int i = 0; i < terms.size(); i++) { + terms.get(ords[i], ref); + out.print(ref+" "+ref.utf8ToString()+" "); + try { + out.print(Long.toHexString(NumericUtils.prefixCodedToLong(ref))+"L"); + } catch (Exception e) { + try { + out.print(Integer.toHexString(NumericUtils.prefixCodedToInt(ref))+"i"); + } catch (Exception ee) { + } + } + out.println(" score="+scores[ords[i]]); + out.println(""); + } + } } Index: lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (working copy) @@ -1,5 +1,8 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Arrays; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -18,22 +21,16 @@ */ import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.BytesRefHash; -import java.io.IOException; -import java.util.Arrays; +abstract class TermsWithScoreCollector extends DocValuesTermsCollector + implements GenericTermsCollector { -abstract class TermsWithScoreCollector extends SimpleCollector { - private final static int INITIAL_ARRAY_SIZE = 0; - final String field; final BytesRefHash collectedTerms = new BytesRefHash(); final ScoreMode scoreMode; @@ -40,8 +37,8 @@ Scorer scorer; float[] scoreSums = new float[INITIAL_ARRAY_SIZE]; - TermsWithScoreCollector(String field, ScoreMode scoreMode) { - this.field = field; + TermsWithScoreCollector(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall); this.scoreMode = scoreMode; if (scoreMode == ScoreMode.Min) { Arrays.fill(scoreSums, Float.POSITIVE_INFINITY); @@ -50,10 +47,12 @@ } } + @Override public BytesRefHash getCollectedTerms() { return collectedTerms; } - + + @Override public float[] getScoresPerTerm() { return scoreSums; } @@ -70,36 +69,34 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsWithScoreCollector} instance */ - static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { + static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { if (multipleValuesPerDocument) { switch (scoreMode) { case Avg: - return new MV.Avg(field); + return new MV.Avg(sortedSetDocValues(field)); default: - return new MV(field, scoreMode); + return new MV(sortedSetDocValues(field), scoreMode); } } else { switch (scoreMode) { case Avg: - return new SV.Avg(field); + return new SV.Avg(binaryDocValues(field)); default: - return new SV(field, scoreMode); + return new SV(binaryDocValues(field), scoreMode); } } } - + // impl that works with single value per document - static class SV extends TermsWithScoreCollector { + static class SV extends TermsWithScoreCollector { - BinaryDocValues fromDocTerms; - - SV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + SV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -133,26 +130,23 @@ scoreSums[ord] = current; } break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } - static class Avg extends SV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -187,20 +181,18 @@ } // impl that works with multiple values per document - static class MV extends TermsWithScoreCollector { + static class MV extends TermsWithScoreCollector { - SortedSetDocValues fromDocTermOrds; - - MV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + MV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { @@ -225,29 +217,26 @@ case Max: scoreSums[termID] = Math.max(scoreSums[termID], scorer.score()); break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTermOrds = DocValues.getSortedSet(context.reader(), field); - } - static class Avg extends MV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { Index: lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java =================================================================== --- lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (revision 1718443) +++ lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (working copy) @@ -19,6 +19,7 @@ import java.io.IOException; import java.util.ArrayList; +import java.util.Arrays; import java.util.Collections; import java.util.Comparator; import java.util.HashMap; @@ -26,6 +27,7 @@ import java.util.List; import java.util.Locale; import java.util.Map; +import java.util.Random; import java.util.Set; import java.util.SortedSet; import java.util.TreeSet; @@ -37,8 +39,12 @@ import org.apache.lucene.analysis.MockTokenizer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; import org.apache.lucene.document.NumericDocValuesField; import org.apache.lucene.document.SortedDocValuesField; +import org.apache.lucene.document.SortedNumericDocValuesField; import org.apache.lucene.document.SortedSetDocValuesField; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; @@ -850,10 +856,18 @@ } final Query joinQuery; - if (from) { - joinQuery = JoinUtil.createJoinQuery("from", multipleValuesPerDocument, "to", actualQuery, indexSearcher, scoreMode); - } else { - joinQuery = JoinUtil.createJoinQuery("to", multipleValuesPerDocument, "from", actualQuery, indexSearcher, scoreMode); + { + // single val can be handled by multiple-vals + final boolean muliValsQuery = multipleValuesPerDocument || random().nextBoolean(); + final String fromField = from ? "from":"to"; + final String toField = from ? "to":"from"; + + if (random().nextBoolean()) { // numbers + final NumericType numType = random().nextBoolean() ? NumericType.INT: NumericType.LONG ; + joinQuery = JoinUtil.createJoinQuery(fromField+numType, muliValsQuery, toField+numType, numType, actualQuery, indexSearcher, scoreMode); + } else { + joinQuery = JoinUtil.createJoinQuery(fromField, muliValsQuery, toField, actualQuery, indexSearcher, scoreMode); + } } if (VERBOSE) { System.out.println("joinQuery=" + joinQuery); @@ -897,7 +911,6 @@ return; } - assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); if (VERBOSE) { for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { System.out.printf(Locale.ENGLISH, "Expected doc: %d | Actual doc: %d\n", expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -904,6 +917,7 @@ System.out.printf(Locale.ENGLISH, "Expected score: %f | Actual score: %f\n", expectedTopDocs.scoreDocs[i].score, actualTopDocs.scoreDocs[i].score); } } + assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { assertEquals(expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -919,14 +933,15 @@ } Directory dir = newDirectory(); + final Random random = random(); RandomIndexWriter w = new RandomIndexWriter( - random(), + random, dir, - newIndexWriterConfig(new MockAnalyzer(random(), MockTokenizer.KEYWORD, false)) + newIndexWriterConfig(new MockAnalyzer(random, MockTokenizer.KEYWORD, false)) ); IndexIterationContext context = new IndexIterationContext(); - int numRandomValues = nDocs / RandomInts.randomIntBetween(random(), 2, 10); + int numRandomValues = nDocs / RandomInts.randomIntBetween(random, 1, 4); context.randomUniqueValues = new String[numRandomValues]; Set trackSet = new HashSet<>(); context.randomFrom = new boolean[numRandomValues]; @@ -933,32 +948,46 @@ for (int i = 0; i < numRandomValues; i++) { String uniqueRandomValue; do { -// uniqueRandomValue = TestUtil.randomRealisticUnicodeString(random()); - uniqueRandomValue = TestUtil.randomSimpleString(random()); + // the trick is to generate values which will be ordered similarly for string, ints&longs, positive nums makes it easier + final int nextInt = random.nextInt(Integer.MAX_VALUE); + uniqueRandomValue = String.format(Locale.ROOT, "%08x", nextInt); + assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16); } while ("".equals(uniqueRandomValue) || trackSet.contains(uniqueRandomValue)); + // Generate unique values and empty strings aren't allowed. trackSet.add(uniqueRandomValue); - context.randomFrom[i] = random().nextBoolean(); + + context.randomFrom[i] = random.nextBoolean(); context.randomUniqueValues[i] = uniqueRandomValue; + } + List randomUniqueValuesReplica = new ArrayList<>(Arrays.asList(context.randomUniqueValues)); + RandomDoc[] docs = new RandomDoc[nDocs]; for (int i = 0; i < nDocs; i++) { String id = Integer.toString(i); - int randomI = random().nextInt(context.randomUniqueValues.length); + int randomI = random.nextInt(context.randomUniqueValues.length); String value = context.randomUniqueValues[randomI]; Document document = new Document(); - document.add(newTextField(random(), "id", id, Field.Store.YES)); - document.add(newTextField(random(), "value", value, Field.Store.NO)); + document.add(newTextField(random, "id", id, Field.Store.YES)); + document.add(newTextField(random, "value", value, Field.Store.NO)); boolean from = context.randomFrom[randomI]; - int numberOfLinkValues = multipleValuesPerDocument ? 2 + random().nextInt(10) : 1; + int numberOfLinkValues = multipleValuesPerDocument ? Math.min(2 + random.nextInt(10), context.randomUniqueValues.length) : 1; docs[i] = new RandomDoc(id, numberOfLinkValues, value, from); if (globalOrdinalJoin) { document.add(newStringField("type", from ? "from" : "to", Field.Store.NO)); } - for (int j = 0; j < numberOfLinkValues; j++) { - String linkValue = context.randomUniqueValues[random().nextInt(context.randomUniqueValues.length)]; + final List subValues; + { + int start = randomUniqueValuesReplica.size()==numberOfLinkValues? 0 : random.nextInt(randomUniqueValuesReplica.size()-numberOfLinkValues); + subValues = randomUniqueValuesReplica.subList(start, start+numberOfLinkValues); + Collections.shuffle(subValues, random); + } + for (String linkValue : subValues) { + + assert !docs[i].linkValues.contains(linkValue); docs[i].linkValues.add(linkValue); if (from) { if (!context.fromDocuments.containsKey(linkValue)) { @@ -970,15 +999,8 @@ context.fromDocuments.get(linkValue).add(docs[i]); context.randomValueFromDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "from", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("from", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("from", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "from", linkValue, multipleValuesPerDocument, globalOrdinalJoin); + } else { if (!context.toDocuments.containsKey(linkValue)) { context.toDocuments.put(linkValue, new ArrayList()); @@ -989,20 +1011,12 @@ context.toDocuments.get(linkValue).add(docs[i]); context.randomValueToDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "to", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("to", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("to", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "to", linkValue, multipleValuesPerDocument, globalOrdinalJoin); } } w.addDocument(document); - if (random().nextInt(10) == 4) { + if (random.nextInt(10) == 4) { w.commit(); } if (VERBOSE) { @@ -1010,7 +1024,7 @@ } } - if (random().nextBoolean()) { + if (random.nextBoolean()) { w.forceMerge(1); } w.close(); @@ -1185,6 +1199,30 @@ return context; } + private void addLinkFields(final Random random, Document document, final String fieldName, String linkValue, + boolean multipleValuesPerDocument, boolean globalOrdinalJoin) { + document.add(newTextField(random, fieldName, linkValue, Field.Store.NO)); + + final int linkInt = Integer.parseUnsignedInt(linkValue,16); + document.add(new IntField(fieldName+NumericType.INT, linkInt, Field.Store.NO)); + + final long linkLong = linkInt<<32 | linkInt; + document.add(new LongField(fieldName+NumericType.LONG, linkLong, Field.Store.NO)); + + if (multipleValuesPerDocument) { + document.add(new SortedSetDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } else { + document.add(new SortedDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new NumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new NumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } + if (globalOrdinalJoin) { + document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); + } + } + private TopDocs createExpectedTopDocs(String queryValue, final boolean from, final ScoreMode scoreMode, Index: lucene/join =================================================================== --- lucene/join (revision 1718443) +++ lucene/join (working copy) Property changes on: lucene/join ___________________________________________________________________ Modified: svn:mergeinfo Merged /lucene/dev/trunk/lucene/join:r1718443 Index: lucene =================================================================== --- lucene (revision 1718443) +++ lucene (working copy) Property changes on: lucene ___________________________________________________________________ Modified: svn:mergeinfo Merged /lucene/dev/trunk/lucene:r1718443 Index: . =================================================================== --- . (revision 1718443) +++ . (working copy) Property changes on: . ___________________________________________________________________ Modified: svn:mergeinfo Merged /lucene/dev/trunk:r1718443 ```

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

fixed some precommit issues in LUCENE-5868-5x.patch

LUCENE-5868-5x.patch

```diff Index: lucene/CHANGES.txt =================================================================== --- lucene/CHANGES.txt (revision 1718443) +++ lucene/CHANGES.txt (working copy) @@ -6,6 +6,12 @@ ======================= Lucene 5.5.0 ======================= +New Features + +* LUCENE-5868: JoinUtil.createJoinQuery(..,NumericType,..) query-time join + for LONG and INT fields with NUMERIC and SORTED_NUMERIC doc values. + (Alexey Zelin via Mikhail Khludnev) + API Changes * #7958: Grouping sortWithinGroup variables used to allow null to mean Property changes on: lucene/CHANGES.txt ___________________________________________________________________ Modified: svn:mergeinfo Merged /lucene/dev/trunk/lucene/CHANGES.txt:r1718443 Index: lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/DocValuesTermsCollector.java (working copy) @@ -1,7 +1,6 @@ package org.apache.lucene.search.join; import java.io.IOException; -import java.util.function.LongConsumer; import org.apache.lucene.document.FieldType.NumericType; import org.apache.lucene.index.BinaryDocValues; @@ -35,11 +34,14 @@ abstract class DocValuesTermsCollector extends SimpleCollector { - @FunctionalInterface static interface Function { R apply(LeafReader t) throws IOException ; } + static interface LongConsumer { + void accept(long value); + } + protected DV docValues; private final Function docValuesCall; @@ -52,37 +54,62 @@ docValues = docValuesCall.apply(context.reader()); } - static Function binaryDocValues(String field) { - return (ctx) -> DocValues.getBinary(ctx, field); - } - static Function sortedSetDocValues(String field) { - return (ctx) -> DocValues.getSortedSet(ctx, field); - } - - static Function numericAsBinaryDocValues(String field, NumericType numTyp) { - return (ctx) -> { - final NumericDocValues numeric = DocValues.getNumeric(ctx, field); - final BytesRefBuilder bytes = new BytesRefBuilder(); - - final LongConsumer coder = coder(bytes, numTyp, field); - - return new BinaryDocValues() { + static Function binaryDocValues(final String field) { + return new Function() + { @Override - public BytesRef get(int docID) { - final long lVal = numeric.get(docID); - coder.accept(lVal); - return bytes.get(); + public BinaryDocValues apply(LeafReader ctx) throws IOException { + return DocValues.getBinary(ctx, field); } }; + } + static Function sortedSetDocValues(final String field) { + return new Function() + { + @Override + public SortedSetDocValues apply(LeafReader ctx) throws IOException { + return DocValues.getSortedSet(ctx, field); + } }; } - static LongConsumer coder(BytesRefBuilder bytes, NumericType type, String fieldName){ + static Function numericAsBinaryDocValues(final String field, final NumericType numTyp) { + return new Function() { + @Override + public BinaryDocValues apply(LeafReader ctx) throws IOException { + final NumericDocValues numeric = DocValues.getNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); + + final LongConsumer coder = coder(bytes, numTyp, field); + + return new BinaryDocValues() { + @Override + public BytesRef get(int docID) { + final long lVal = numeric.get(docID); + coder.accept(lVal); + return bytes.get(); + } + }; + } + }; + } + + static LongConsumer coder(final BytesRefBuilder bytes, NumericType type, String fieldName){ switch(type){ case INT: - return (l) -> NumericUtils.intToPrefixCoded((int)l, 0, bytes); + return new LongConsumer() { + @Override + public void accept(long value) { + NumericUtils.intToPrefixCoded((int)value, 0, bytes); + } + }; case LONG: - return (l) -> NumericUtils.longToPrefixCoded(l, 0, bytes); + return new LongConsumer() { + @Override + public void accept(long value) { + NumericUtils.longToPrefixCoded((int)value, 0, bytes); + } + }; default: throw new IllegalArgumentException("Unsupported "+type+ ". Only "+NumericType.INT+" and "+NumericType.LONG+" are supported." @@ -91,46 +118,49 @@ } /** this adapter is quite weird. ords are per doc index, don't use ords across different docs*/ - static Function sortedNumericAsSortedSetDocValues(String field, NumericType numTyp) { - return (ctx) -> { - final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field); - final BytesRefBuilder bytes = new BytesRefBuilder(); - - final LongConsumer coder = coder(bytes, numTyp, field); - - return new SortedSetDocValues() { - - private int index = Integer.MIN_VALUE; - - @Override - public long nextOrd() { - return index < numerics.count()-1 ? ++index : NO_MORE_ORDS; - } - - @Override - public void setDocument(int docID) { - numerics.setDocument(docID); - index=-1; - } - - @Override - public BytesRef lookupOrd(long ord) { - assert ord>=0 && ord sortedNumericAsSortedSetDocValues(final String field, final NumericType numTyp) { + return new Function() { + @Override + public SortedSetDocValues apply(LeafReader ctx) throws IOException { + final SortedNumericDocValues numerics = DocValues.getSortedNumeric(ctx, field); + final BytesRefBuilder bytes = new BytesRefBuilder(); - @Override - public long lookupTerm(BytesRef key) { - throw new UnsupportedOperationException("it's just number encoding wrapper"); - } - }; + final LongConsumer coder = coder(bytes, numTyp, field); + + return new SortedSetDocValues() { + + private int index = Integer.MIN_VALUE; + + @Override + public long nextOrd() { + return index < numerics.count() - 1 ? ++index : NO_MORE_ORDS; + } + + @Override + public void setDocument(int docID) { + numerics.setDocument(docID); + index = -1; + } + + @Override + public BytesRef lookupOrd(long ord) { + assert ord >= 0 && ord < numerics.count(); + final long value = numerics.valueAt((int) ord); + coder.accept(value); + return bytes.get(); + } + + @Override + public long getValueCount() { + throw new UnsupportedOperationException("it's just number encoding wrapper"); + } + + @Override + public long lookupTerm(BytesRef key) { + throw new UnsupportedOperationException("it's just number encoding wrapper"); + } + }; + } }; } } Index: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollector.java (working copy) @@ -1,17 +1,6 @@ package org.apache.lucene.search.join; -import java.io.IOException; -import java.io.PrintStream; - -import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.LeafReaderContext; -import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.Collector; -import org.apache.lucene.search.LeafCollector; -import org.apache.lucene.search.join.DocValuesTermsCollector.Function; -import org.apache.lucene.search.join.TermsWithScoreCollector.MV; -import org.apache.lucene.search.join.TermsWithScoreCollector.SV; -import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; /* @@ -37,87 +26,4 @@ float[] getScoresPerTerm(); - static GenericTermsCollector createCollectorMV(Function mvFunction, - ScoreMode mode) { - - switch (mode) { - case None: - return wrap(new TermsCollector.MV(mvFunction)); - case Avg: - return new MV.Avg(mvFunction); - default: - return new MV(mvFunction, mode); - } - } - - static Function verbose(PrintStream out, Function mvFunction){ - return (ctx) -> { - final SortedSetDocValues target = mvFunction.apply(ctx); - return new SortedSetDocValues() { - - @Override - public void setDocument(int docID) { - target.setDocument(docID); - out.println("\ndoc# "+docID); - } - - @Override - public long nextOrd() { - return target.nextOrd(); - } - - @Override - public BytesRef lookupOrd(long ord) { - final BytesRef val = target.lookupOrd(ord); - out.println(val.toString()+", "); - return val; - } - - @Override - public long getValueCount() { - return target.getValueCount(); - } - }; - - }; - } - - static GenericTermsCollector createCollectorSV(Function svFunction, - ScoreMode mode) { - - switch (mode) { - case None: - return wrap(new TermsCollector.SV(svFunction)); - case Avg: - return new SV.Avg(svFunction); - default: - return new SV(svFunction, mode); - } - } - - static GenericTermsCollector wrap(final TermsCollector collector) { - return new GenericTermsCollector() { - - - @Override - public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { - return collector.getLeafCollector(context); - } - - @Override - public boolean needsScores() { - return collector.needsScores(); - } - - @Override - public BytesRefHash getCollectedTerms() { - return collector.getCollectorTerms(); - } - - @Override - public float[] getScoresPerTerm() { - throw new UnsupportedOperationException("scores are not available for "+collector); - } - }; - } } Index: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollectorFactory.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollectorFactory.java (revision 0) +++ lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollectorFactory.java (working copy) @@ -0,0 +1,86 @@ +package org.apache.lucene.search.join; + +import java.io.IOException; + +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.SortedSetDocValues; +import org.apache.lucene.search.LeafCollector; +import org.apache.lucene.search.join.DocValuesTermsCollector.Function; +import org.apache.lucene.search.join.TermsWithScoreCollector.MV; +import org.apache.lucene.search.join.TermsWithScoreCollector.SV; +import org.apache.lucene.util.BytesRefHash; + +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +final class GenericTermsCollectorFactory { + + private GenericTermsCollectorFactory() {} + + static GenericTermsCollector createCollectorMV(Function mvFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.MV(mvFunction)); + case Avg: + return new MV.Avg(mvFunction); + default: + return new MV(mvFunction, mode); + } + } + + static GenericTermsCollector createCollectorSV(Function svFunction, + ScoreMode mode) { + + switch (mode) { + case None: + return wrap(new TermsCollector.SV(svFunction)); + case Avg: + return new SV.Avg(svFunction); + default: + return new SV(svFunction, mode); + } + } + + static GenericTermsCollector wrap(final TermsCollector collector) { + return new GenericTermsCollector() { + + + @Override + public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { + return collector.getLeafCollector(context); + } + + @Override + public boolean needsScores() { + return collector.needsScores(); + } + + @Override + public BytesRefHash getCollectedTerms() { + return collector.getCollectorTerms(); + } + + @Override + public float[] getScoresPerTerm() { + throw new UnsupportedOperationException("scores are not available for "+collector); + } + }; + } +} Property changes on: lucene/join/src/java/org/apache/lucene/search/join/GenericTermsCollectorFactory.java ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Index: lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/JoinUtil.java (working copy) @@ -1,5 +1,14 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Locale; + +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValuesType; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -24,9 +33,11 @@ import org.apache.lucene.index.LeafReader; import org.apache.lucene.index.MultiDocValues; import org.apache.lucene.index.SortedDocValues; +import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchNoDocsQuery; import org.apache.lucene.search.Query; +import org.apache.lucene.search.join.DocValuesTermsCollector.Function; /** * Utility for query time joining. @@ -67,28 +78,87 @@ * @throws IOException If I/O related errors occur */ public static Query createJoinQuery(String fromField, - boolean multipleValuesPerDocument, - String toField, - Query fromQuery, - IndexSearcher fromSearcher, - ScoreMode scoreMode) throws IOException { + boolean multipleValuesPerDocument, + String toField, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsWithScoreCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedSetDocValues(fromField); + termsWithScoreCollector = GenericTermsCollectorFactory.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.binaryDocValues(fromField); + termsWithScoreCollector = GenericTermsCollectorFactory.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsWithScoreCollector); + + } + + /** + * Method for query time joining for numeric fields. It supports multi- and single- values longs and ints. + * All considerations from {@link JoinUtil#createJoinQuery(String, boolean, String, Query, IndexSearcher, ScoreMode)} are applicable here too, + * though memory consumption might be higher. + *

+ * + * @param fromField The from field to join from + * @param multipleValuesPerDocument Whether the from field has multiple terms per document + * when true fromField might be {@link DocValuesType#SORTED_NUMERIC}, + * otherwise fromField should be {@link DocValuesType#NUMERIC} + * @param toField The to field to join to, should be {@link IntField} or {@link LongField} + * @param numericType either {@link NumericType#INT} or {@link NumericType#LONG}, it should correspond to fromField and toField types + * @param fromQuery The query to match documents on the from side + * @param fromSearcher The searcher that executed the specified fromQuery + * @param scoreMode Instructs how scores from the fromQuery are mapped to the returned query + * @return a {@link Query} instance that can be used to join documents based on the + * terms in the from and to field + * @throws IOException If I/O related errors occur + */ + + public static Query createJoinQuery(String fromField, + boolean multipleValuesPerDocument, + String toField, NumericType numericType, + Query fromQuery, + IndexSearcher fromSearcher, + ScoreMode scoreMode) throws IOException { + + final GenericTermsCollector termsCollector; + + if (multipleValuesPerDocument) { + Function mvFunction = DocValuesTermsCollector.sortedNumericAsSortedSetDocValues(fromField,numericType); + termsCollector = GenericTermsCollectorFactory.createCollectorMV(mvFunction, scoreMode); + } else { + Function svFunction = DocValuesTermsCollector.numericAsBinaryDocValues(fromField,numericType); + termsCollector = GenericTermsCollectorFactory.createCollectorSV(svFunction, scoreMode); + } + + return createJoinQuery(multipleValuesPerDocument, toField, fromQuery, fromSearcher, scoreMode, + termsCollector); + + } + + private static Query createJoinQuery(boolean multipleValuesPerDocument, String toField, Query fromQuery, + IndexSearcher fromSearcher, ScoreMode scoreMode, final GenericTermsCollector collector) + throws IOException { + + fromSearcher.search(fromQuery, collector); + switch (scoreMode) { case None: - TermsCollector termsCollector = TermsCollector.create(fromField, multipleValuesPerDocument); - fromSearcher.search(fromQuery, termsCollector); - return new TermsQuery(toField, fromQuery, termsCollector.getCollectorTerms()); + return new TermsQuery(toField, fromQuery, collector.getCollectedTerms()); case Total: case Max: case Min: case Avg: - TermsWithScoreCollector termsWithScoreCollector = - TermsWithScoreCollector.create(fromField, multipleValuesPerDocument, scoreMode); - fromSearcher.search(fromQuery, termsWithScoreCollector); return new TermsIncludingScoreQuery( toField, multipleValuesPerDocument, - termsWithScoreCollector.getCollectedTerms(), - termsWithScoreCollector.getScoresPerTerm(), + collector.getCollectedTerms(), + collector.getScoresPerTerm(), fromQuery ); default: @@ -96,6 +166,7 @@ } } + /** * Delegates to {@link #createJoinQuery(String, Query, Query, IndexSearcher, ScoreMode, MultiDocValues.OrdinalMap, int, int)}, * but disables the min and max filtering. Index: lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsCollector.java (working copy) @@ -19,11 +19,8 @@ import java.io.IOException; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; import org.apache.lucene.index.SortedSetDocValues; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; @@ -32,19 +29,19 @@ * * @lucene.experimental */ -abstract class TermsCollector extends SimpleCollector { +abstract class TermsCollector extends DocValuesTermsCollector { - final String field; + TermsCollector(Function docValuesCall) { + super(docValuesCall); + } + final BytesRefHash collectorTerms = new BytesRefHash(); - TermsCollector(String field) { - this.field = field; - } - public BytesRefHash getCollectorTerms() { return collectorTerms; } + /** * Chooses the right {@link TermsCollector} implementation. * @@ -52,55 +49,42 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsCollector} instance */ - static TermsCollector create(String field, boolean multipleValuesPerDocument) { - return multipleValuesPerDocument ? new MV(field) : new SV(field); + static TermsCollector create(String field, boolean multipleValuesPerDocument) { + return multipleValuesPerDocument + ? new MV(sortedSetDocValues(field)) + : new SV(binaryDocValues(field)); } - + // impl that works with multiple values per document - static class MV extends TermsCollector { - final BytesRef scratch = new BytesRef(); - private SortedSetDocValues docTermOrds; - - MV(String field) { - super(field); + static class MV extends TermsCollector { + + MV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - docTermOrds.setDocument(doc); long ord; - while ((ord = docTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - final BytesRef term = docTermOrds.lookupOrd(ord); + docValues.setDocument(doc); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + final BytesRef term = docValues.lookupOrd(ord); collectorTerms.add(term); } } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - docTermOrds = DocValues.getSortedSet(context.reader(), field); - } } // impl that works with single value per document - static class SV extends TermsCollector { + static class SV extends TermsCollector { - final BytesRef spare = new BytesRef(); - private BinaryDocValues fromDocTerms; - - SV(String field) { - super(field); + SV(Function docValuesCall) { + super(docValuesCall); } @Override public void collect(int doc) throws IOException { - final BytesRef term = fromDocTerms.get(doc); + final BytesRef term = docValues.get(doc); collectorTerms.add(term); } - - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } } @Override Index: lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsIncludingScoreQuery.java (working copy) @@ -18,6 +18,7 @@ */ import java.io.IOException; +import java.io.PrintStream; import java.util.Locale; import java.util.Set; @@ -37,6 +38,7 @@ import org.apache.lucene.util.BytesRef; import org.apache.lucene.util.BytesRefHash; import org.apache.lucene.util.FixedBitSet; +import org.apache.lucene.util.NumericUtils; class TermsIncludingScoreQuery extends Query { @@ -271,5 +273,23 @@ } } } - + + void dump(PrintStream out){ + out.println(field+":"); + final BytesRef ref = new BytesRef(); + for (int i = 0; i < terms.size(); i++) { + terms.get(ords[i], ref); + out.print(ref+" "+ref.utf8ToString()+" "); + try { + out.print(Long.toHexString(NumericUtils.prefixCodedToLong(ref))+"L"); + } catch (Exception e) { + try { + out.print(Integer.toHexString(NumericUtils.prefixCodedToInt(ref))+"i"); + } catch (Exception ee) { + } + } + out.println(" score="+scores[ords[i]]); + out.println(""); + } + } } Index: lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java =================================================================== --- lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (revision 1718443) +++ lucene/join/src/java/org/apache/lucene/search/join/TermsWithScoreCollector.java (working copy) @@ -1,5 +1,8 @@ package org.apache.lucene.search.join; +import java.io.IOException; +import java.util.Arrays; + /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -18,22 +21,16 @@ */ import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.DocValues; -import org.apache.lucene.index.LeafReaderContext; import org.apache.lucene.index.SortedSetDocValues; import org.apache.lucene.search.Scorer; -import org.apache.lucene.search.SimpleCollector; import org.apache.lucene.util.ArrayUtil; import org.apache.lucene.util.BytesRefHash; -import java.io.IOException; -import java.util.Arrays; +abstract class TermsWithScoreCollector extends DocValuesTermsCollector + implements GenericTermsCollector { -abstract class TermsWithScoreCollector extends SimpleCollector { - private final static int INITIAL_ARRAY_SIZE = 0; - final String field; final BytesRefHash collectedTerms = new BytesRefHash(); final ScoreMode scoreMode; @@ -40,8 +37,8 @@ Scorer scorer; float[] scoreSums = new float[INITIAL_ARRAY_SIZE]; - TermsWithScoreCollector(String field, ScoreMode scoreMode) { - this.field = field; + TermsWithScoreCollector(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall); this.scoreMode = scoreMode; if (scoreMode == ScoreMode.Min) { Arrays.fill(scoreSums, Float.POSITIVE_INFINITY); @@ -50,10 +47,12 @@ } } + @Override public BytesRefHash getCollectedTerms() { return collectedTerms; } - + + @Override public float[] getScoresPerTerm() { return scoreSums; } @@ -70,36 +69,34 @@ * @param multipleValuesPerDocument Whether the field to collect terms for has multiple values per document. * @return a {@link TermsWithScoreCollector} instance */ - static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { + static TermsWithScoreCollector create(String field, boolean multipleValuesPerDocument, ScoreMode scoreMode) { if (multipleValuesPerDocument) { switch (scoreMode) { case Avg: - return new MV.Avg(field); + return new MV.Avg(sortedSetDocValues(field)); default: - return new MV(field, scoreMode); + return new MV(sortedSetDocValues(field), scoreMode); } } else { switch (scoreMode) { case Avg: - return new SV.Avg(field); + return new SV.Avg(binaryDocValues(field)); default: - return new SV(field, scoreMode); + return new SV(binaryDocValues(field), scoreMode); } } } - + // impl that works with single value per document - static class SV extends TermsWithScoreCollector { + static class SV extends TermsWithScoreCollector { - BinaryDocValues fromDocTerms; - - SV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + SV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -133,26 +130,23 @@ scoreSums[ord] = current; } break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTerms = DocValues.getBinary(context.reader(), field); - } - static class Avg extends SV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - int ord = collectedTerms.add(fromDocTerms.get(doc)); + int ord = collectedTerms.add(docValues.get(doc)); if (ord < 0) { ord = -ord - 1; } else { @@ -187,20 +181,18 @@ } // impl that works with multiple values per document - static class MV extends TermsWithScoreCollector { + static class MV extends TermsWithScoreCollector { - SortedSetDocValues fromDocTermOrds; - - MV(String field, ScoreMode scoreMode) { - super(field, scoreMode); + MV(Function docValuesCall, ScoreMode scoreMode) { + super(docValuesCall, scoreMode); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { @@ -225,29 +217,26 @@ case Max: scoreSums[termID] = Math.max(scoreSums[termID], scorer.score()); break; + default: + throw new AssertionError("unexpected: " + scoreMode); } } } - @Override - protected void doSetNextReader(LeafReaderContext context) throws IOException { - fromDocTermOrds = DocValues.getSortedSet(context.reader(), field); - } - static class Avg extends MV { int[] scoreCounts = new int[INITIAL_ARRAY_SIZE]; - Avg(String field) { - super(field, ScoreMode.Avg); + Avg(Function docValuesCall) { + super(docValuesCall, ScoreMode.Avg); } @Override public void collect(int doc) throws IOException { - fromDocTermOrds.setDocument(doc); + docValues.setDocument(doc); long ord; - while ((ord = fromDocTermOrds.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { - int termID = collectedTerms.add(fromDocTermOrds.lookupOrd(ord)); + while ((ord = docValues.nextOrd()) != SortedSetDocValues.NO_MORE_ORDS) { + int termID = collectedTerms.add(docValues.lookupOrd(ord)); if (termID < 0) { termID = -termID - 1; } else { Index: lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java =================================================================== --- lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (revision 1718443) +++ lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java (working copy) @@ -19,6 +19,7 @@ import java.io.IOException; import java.util.ArrayList; +import java.util.Arrays; import java.util.Collections; import java.util.Comparator; import java.util.HashMap; @@ -26,6 +27,7 @@ import java.util.List; import java.util.Locale; import java.util.Map; +import java.util.Random; import java.util.Set; import java.util.SortedSet; import java.util.TreeSet; @@ -37,8 +39,12 @@ import org.apache.lucene.analysis.MockTokenizer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; +import org.apache.lucene.document.FieldType.NumericType; +import org.apache.lucene.document.IntField; +import org.apache.lucene.document.LongField; import org.apache.lucene.document.NumericDocValuesField; import org.apache.lucene.document.SortedDocValuesField; +import org.apache.lucene.document.SortedNumericDocValuesField; import org.apache.lucene.document.SortedSetDocValuesField; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; @@ -850,10 +856,18 @@ } final Query joinQuery; - if (from) { - joinQuery = JoinUtil.createJoinQuery("from", multipleValuesPerDocument, "to", actualQuery, indexSearcher, scoreMode); - } else { - joinQuery = JoinUtil.createJoinQuery("to", multipleValuesPerDocument, "from", actualQuery, indexSearcher, scoreMode); + { + // single val can be handled by multiple-vals + final boolean muliValsQuery = multipleValuesPerDocument || random().nextBoolean(); + final String fromField = from ? "from":"to"; + final String toField = from ? "to":"from"; + + if (random().nextBoolean()) { // numbers + final NumericType numType = random().nextBoolean() ? NumericType.INT: NumericType.LONG ; + joinQuery = JoinUtil.createJoinQuery(fromField+numType, muliValsQuery, toField+numType, numType, actualQuery, indexSearcher, scoreMode); + } else { + joinQuery = JoinUtil.createJoinQuery(fromField, muliValsQuery, toField, actualQuery, indexSearcher, scoreMode); + } } if (VERBOSE) { System.out.println("joinQuery=" + joinQuery); @@ -897,7 +911,6 @@ return; } - assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); if (VERBOSE) { for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { System.out.printf(Locale.ENGLISH, "Expected doc: %d | Actual doc: %d\n", expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -904,6 +917,7 @@ System.out.printf(Locale.ENGLISH, "Expected score: %f | Actual score: %f\n", expectedTopDocs.scoreDocs[i].score, actualTopDocs.scoreDocs[i].score); } } + assertEquals(expectedTopDocs.getMaxScore(), actualTopDocs.getMaxScore(), 0.0f); for (int i = 0; i < expectedTopDocs.scoreDocs.length; i++) { assertEquals(expectedTopDocs.scoreDocs[i].doc, actualTopDocs.scoreDocs[i].doc); @@ -919,14 +933,15 @@ } Directory dir = newDirectory(); + final Random random = random(); RandomIndexWriter w = new RandomIndexWriter( - random(), + random, dir, - newIndexWriterConfig(new MockAnalyzer(random(), MockTokenizer.KEYWORD, false)) + newIndexWriterConfig(new MockAnalyzer(random, MockTokenizer.KEYWORD, false)) ); IndexIterationContext context = new IndexIterationContext(); - int numRandomValues = nDocs / RandomInts.randomIntBetween(random(), 2, 10); + int numRandomValues = nDocs / RandomInts.randomIntBetween(random, 1, 4); context.randomUniqueValues = new String[numRandomValues]; Set trackSet = new HashSet<>(); context.randomFrom = new boolean[numRandomValues]; @@ -933,32 +948,46 @@ for (int i = 0; i < numRandomValues; i++) { String uniqueRandomValue; do { -// uniqueRandomValue = TestUtil.randomRealisticUnicodeString(random()); - uniqueRandomValue = TestUtil.randomSimpleString(random()); + // the trick is to generate values which will be ordered similarly for string, ints&longs, positive nums makes it easier + final int nextInt = random.nextInt(Integer.MAX_VALUE); + uniqueRandomValue = String.format(Locale.ROOT, "%08x", nextInt); + assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16); } while ("".equals(uniqueRandomValue) || trackSet.contains(uniqueRandomValue)); + // Generate unique values and empty strings aren't allowed. trackSet.add(uniqueRandomValue); - context.randomFrom[i] = random().nextBoolean(); + + context.randomFrom[i] = random.nextBoolean(); context.randomUniqueValues[i] = uniqueRandomValue; + } + List randomUniqueValuesReplica = new ArrayList<>(Arrays.asList(context.randomUniqueValues)); + RandomDoc[] docs = new RandomDoc[nDocs]; for (int i = 0; i < nDocs; i++) { String id = Integer.toString(i); - int randomI = random().nextInt(context.randomUniqueValues.length); + int randomI = random.nextInt(context.randomUniqueValues.length); String value = context.randomUniqueValues[randomI]; Document document = new Document(); - document.add(newTextField(random(), "id", id, Field.Store.YES)); - document.add(newTextField(random(), "value", value, Field.Store.NO)); + document.add(newTextField(random, "id", id, Field.Store.YES)); + document.add(newTextField(random, "value", value, Field.Store.NO)); boolean from = context.randomFrom[randomI]; - int numberOfLinkValues = multipleValuesPerDocument ? 2 + random().nextInt(10) : 1; + int numberOfLinkValues = multipleValuesPerDocument ? Math.min(2 + random.nextInt(10), context.randomUniqueValues.length) : 1; docs[i] = new RandomDoc(id, numberOfLinkValues, value, from); if (globalOrdinalJoin) { document.add(newStringField("type", from ? "from" : "to", Field.Store.NO)); } - for (int j = 0; j < numberOfLinkValues; j++) { - String linkValue = context.randomUniqueValues[random().nextInt(context.randomUniqueValues.length)]; + final List subValues; + { + int start = randomUniqueValuesReplica.size()==numberOfLinkValues? 0 : random.nextInt(randomUniqueValuesReplica.size()-numberOfLinkValues); + subValues = randomUniqueValuesReplica.subList(start, start+numberOfLinkValues); + Collections.shuffle(subValues, random); + } + for (String linkValue : subValues) { + + assert !docs[i].linkValues.contains(linkValue); docs[i].linkValues.add(linkValue); if (from) { if (!context.fromDocuments.containsKey(linkValue)) { @@ -970,15 +999,8 @@ context.fromDocuments.get(linkValue).add(docs[i]); context.randomValueFromDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "from", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("from", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("from", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "from", linkValue, multipleValuesPerDocument, globalOrdinalJoin); + } else { if (!context.toDocuments.containsKey(linkValue)) { context.toDocuments.put(linkValue, new ArrayList()); @@ -989,20 +1011,12 @@ context.toDocuments.get(linkValue).add(docs[i]); context.randomValueToDocs.get(value).add(docs[i]); - document.add(newTextField(random(), "to", linkValue, Field.Store.NO)); - if (multipleValuesPerDocument) { - document.add(new SortedSetDocValuesField("to", new BytesRef(linkValue))); - } else { - document.add(new SortedDocValuesField("to", new BytesRef(linkValue))); - } - if (globalOrdinalJoin) { - document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); - } + addLinkFields(random, document, "to", linkValue, multipleValuesPerDocument, globalOrdinalJoin); } } w.addDocument(document); - if (random().nextInt(10) == 4) { + if (random.nextInt(10) == 4) { w.commit(); } if (VERBOSE) { @@ -1010,7 +1024,7 @@ } } - if (random().nextBoolean()) { + if (random.nextBoolean()) { w.forceMerge(1); } w.close(); @@ -1185,6 +1199,30 @@ return context; } + private void addLinkFields(final Random random, Document document, final String fieldName, String linkValue, + boolean multipleValuesPerDocument, boolean globalOrdinalJoin) { + document.add(newTextField(random, fieldName, linkValue, Field.Store.NO)); + + final int linkInt = Integer.parseUnsignedInt(linkValue,16); + document.add(new IntField(fieldName+NumericType.INT, linkInt, Field.Store.NO)); + + final long linkLong = linkInt<<32 | linkInt; + document.add(new LongField(fieldName+NumericType.LONG, linkLong, Field.Store.NO)); + + if (multipleValuesPerDocument) { + document.add(new SortedSetDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new SortedNumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } else { + document.add(new SortedDocValuesField(fieldName, new BytesRef(linkValue))); + document.add(new NumericDocValuesField(fieldName+NumericType.INT, linkInt)); + document.add(new NumericDocValuesField(fieldName+NumericType.LONG, linkLong)); + } + if (globalOrdinalJoin) { + document.add(new SortedDocValuesField("join_field", new BytesRef(linkValue))); + } + } + private TopDocs createExpectedTopDocs(String queryValue, final boolean from, final ScoreMode scoreMode, Index: lucene/join =================================================================== --- lucene/join (revision 1718443) +++ lucene/join (working copy) Property changes on: lucene/join ___________________________________________________________________ Modified: svn:mergeinfo Merged /lucene/dev/trunk/lucene/join:r1718443 Index: lucene =================================================================== --- lucene (revision 1718443) +++ lucene (working copy) Property changes on: lucene ___________________________________________________________________ Modified: svn:mergeinfo Merged /lucene/dev/trunk/lucene:r1718443 Index: . =================================================================== --- . (revision 1718443) +++ . (working copy) Property changes on: . ___________________________________________________________________ Modified: svn:mergeinfo Merged /lucene/dev/trunk:r1718443 ```

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1718473 from mkhl@apache.org in branch 'dev/branches/branch_5x' https://svn.apache.org/r1718473

LUCENE-5868: query-time join for numerics

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

it passed https://builds.apache.org/job/Lucene-Solr-NightlyTests-trunk/875/console

asfimport commented 8 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Compilation is failing on branch_5x: https://builds.apache.org/job/Lucene-Artifacts-5.x/1037/ (java8 constructs on a java7 build I imagine):

   [javac] Compiling 6 source files to /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/build/join/classes/test
   [javac] /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java:954: error: cannot find symbol
   [javac]         assert nextInt == Integer.parseUnsignedInt(uniqueRandomValue,16);
   [javac]                                  ^
   [javac]   symbol:   method parseUnsignedInt(String,int)
   [javac]   location: class Integer
   [javac] /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/join/src/test/org/apache/lucene/search/join/TestJoinUtil.java:1206: error: cannot find symbol
   [javac]     final int linkInt = Integer.parseUnsignedInt(linkValue,16);
   [javac]                                ^
   [javac]   symbol:   method parseUnsignedInt(String,int)
   [javac]   location: class Integer
   [javac] Note: /x1/jenkins/jenkins-slave/workspace/Lucene-Artifacts-5.x/lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java uses or overrides a deprecated API.
   [javac] Note: Recompile with -Xlint:deprecation for details.
   [javac] 2 errors

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

I'm going to fix it

asfimport commented 8 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1718517 from mkhl@apache.org in branch 'dev/branches/branch_5x' https://svn.apache.org/r1718517

LUCENE-5868: removing Java8's parseUnsignedInt

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

Oh. Yeah!! 5.x is fixed https://builds.apache.org/job/Lucene-Solr-Tests-5.x-Java7/3822/changes

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

Suggestions for javadoc are accepted

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

Index: lucene/common-build.xml
===================================================================
--- lucene/common-build.xml (revision 1718516)
+++ lucene/common-build.xml (working copy)
`@@` -321,6 +321,12 `@@`
     </condition>
   </fail>

+  <fail message="Maximum supported Java version is 1.7.">
+    <condition>
+      <hasmethod classname="java.lang.Integer" method="parseUnsignedInt"/>
+    </condition>
+  </fail>
+   
   <!-- temporary for cleanup of java.specification.version, to be in format "x.y" -->
   <loadresource property="-cleaned.specification.version">
     <propertyresource name="java.specification.version"/>

This amendment will keep me from such mistakes in 5.x. Let me know if there is a more regular approach.

asfimport commented 8 years ago

Mikhail Khludnev (@mkhludnev) (migrated from JIRA)

raised SOLR-8395 as Solr enablement

apache / lucene

JoinUtil support for NUMERIC docValues fields [LUCENE-5868] #6930