Query-time join collectors could maybe be more efficient [LUCENE-4771]

asfimport commented 11 years ago

I was looking @ these collectors on #5830 and I noticed:

SingleValued collector (SV) pulls FieldCache.getTerms and adds the bytes to a bytesrefhash per-collect.
MultiValued collector (MV) pulls FieldCache.getDocTermsOrds, but doesnt use the ords, just looks up each value and adds the bytes per-collect.

I think instead its worth investigating if SV should use getTermsIndex, and both collectors just collect-up their per-segment ords in something like a BitSet[maxOrd].

When asked for the terms at the end in getCollectorTerms(), they could merge these into one BytesRefHash.

Of course, if you are going to turn around and execute the query against the same searcher anyway (is this the typical case?), this could even be more efficient: No need to hash or instantiate all the terms in memory, we could do postpone the lookups to SeekingTermSetTermsEnum.accept()/nextSeekTerm() i think... somehow :)

Migrated from LUCENE-4771 by Robert Muir (@rmuir), updated Sep 17 2013 Attachments: LUCENE-4771_prototype_without_bug.patch, LUCENE-4771_prototype.patch, LUCENE-4771-prototype.patch

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here's a prototype... it seems to mostly work but there is a bug eventually in the random test :)

its against branches/lucene4765 because thats just what i was working with.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch without the bug. now the random test passes always.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

for this to be fast the BitsFilteredTermsEnum would have to seek to nextSetBit or whatever... I think thats easy to fix.

Another thing i dont like is that then the Query returned by JoinUtil then depends on the old reader.

But i think it demonstrates the idea... a faster collect() and not having to hash and load up all the term bytes in RAM.

asfimport commented 11 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 this looks very compelling!

asfimport commented 11 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

+1 Very nice improvement.

Another thing i dont like is that then the Query returned by JoinUtil then depends on the old reader.

I thought about it and the only solution I can come up with is to just keep the reference of the reader around. Just put the atomic reader somewhere in the BytesRefIterable impl during the first phase.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

hmm I think there are alternative solutions: but we have to change the API.

Really the TermsQuery generated here is not good, because it holds reader-dependent state (its like a already-rewritten multitermquery). Then its allowed to have a 'delayed rewrite'.

A simple solution could be to take the 'target reader' as a parameter too, and rewrite it ourselves, returning what TermsQuery rewrites to back to the user (e.g. ConstantScore or whatever).

asfimport commented 11 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

A simple solution could be to take the 'target reader' as a parameter too, and rewrite it ourselves, returning what TermsQuery rewrites to back to the user (e.g. ConstantScore or whatever).

Adding the toSearcher as parameter to JoinUtil, seems like a good idea. If we do this then we can automatically enable the second optimisation you mentioned in the issue description (postpone lookups).

asfimport commented 11 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Updated patch with the current trunk codebase.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks for updating Martijn! I plan to look at this later tonight and work on pulling out the BitsFilteredTermsEnum and making it more efficient. After that, I think we should revisit the intersection (I started with something ultra-simple here) to make sure its optimal too.

Somehow actually we should try to come up with a standard benchmark (luceneutil or similar) so that we can test the approach for the single-valued case there too. My intuition says I think it can be a win in both cases.

asfimport commented 11 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

I also think that this approach will improve single-valued joining too. Just collecting the ordinals and fetching the actual terms on the fly without hashing should be much faster.

Just wondering how to make a standard benchmark. Usually when I test joining I generate random docs. Simple docs with with random from values and docs with matching to values and also have different from to to docs ratios. Maybe we can use the stackoverflow dataset (join questions and answers) as test dataset with relational like data. Not sure if this is possible licence wise.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

By the way, I havent gone further here but I think this approach here can be done even faster.

The patch here would simple reduce some hashing and ram-usage, but still be slow for each query if there are many terms, because its 'aligning' terms per-query.

instead, someone could simply make an OrdinalMap (it takes TermsEnum[]) of the fields they are joining, which will "align" terms across all those segments/fields into a global space, and the joining can just work on simple ords.

this way you only align terms per-reopen instead of per-query. I think this would be much faster.

asfimport commented 11 years ago

Martijn van Groningen (@martijnvg) (migrated from JIRA)

Yes, that would work really nice. The TermsCollector can than just collect global ordinals in a BitSet impl and the TermsQuery can just iterate from this BitSet.

apache / lucene

Query-time join collectors could maybe be more efficient [LUCENE-4771] #5836