apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Add CollectorRescorer [LUCENE-8791] #9835

Open asfimport opened 5 years ago

asfimport commented 5 years ago

This is another implementation of query rescorer api (#6552). It adds rescoring functionality based on provided CollectorManager.


Migrated from LUCENE-8791 by Elbek Kamoliddinov, updated Jun 14 2019 Attachments: LUCENE-8791.patch (versions: 5)

asfimport commented 5 years ago

Elbek Kamoliddinov (migrated from JIRA)

Attached the patch, adds CollectorRescorer, and test case. Test case became bit verbose, because of inlined CollectorManager, LeafCollector classes

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

Hi Elbek Kamoliddinov,

 

A few comments:

 

1) A better naming scheme would be helpful (eg. exec could be renamed to execManager for better context, ditto for pending docIDs list, using verbose names for objects instead of "Collector c")

2) Please have a look again at the spacing. In general, it would be good if the code was a bit more readable w.r.t spacing around braces, breaking the code into logical paragraphes.

3) Could the class Job have a better name? Maybe RescoreTask?

4) Same for the class Jobs.

5) Please add a descriptive error message in the assert for ensuring that the docID post skipping a segment is higher than the maxDoc.

6) In general, I am a bit unsure about the class name and description. The class advertises to be a Collector based rescorer implementation, whereas the only constructor of the class takes a CollectorManager instance, which is specific to parallel segments traversal. Should the class say that it is used for parallel rescoring of hits across multiple segments?

7) I am a bit wary of the way the patch maps the global jobs array to jobs associated with segments belonging to one slice. In particular, buildJobs method does not consider leaf ordinal during assigning values to the global array, but rescore depends on that during retrieval.

asfimport commented 5 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Please have a look again at the spacing. In general, it would be good if the code was a bit more readable w.r.t spacing around braces, breaking the code into logical paragraphes.

The spacing/indenting looks correct to me – it seems to match Lucene's coding guidelines (https://wiki.apache.org/lucene-java/DeveloperTips).

 

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

Apologies then, my formatter must be screwing things up :)

asfimport commented 5 years ago

Elbek Kamoliddinov (migrated from JIRA)

@atris

Hi Atri, Thanks for taking a look at the patch. So added more doc also renamed Job and Jobs to SegmentRescorer and RescorerTask respectively, hopefully this is more descriptive. I left docIDs as it is, this looks like pretty standard. I am not sure if I got the point here:

In general, I am a bit unsure about the class name and description. The class advertises to be a Collector based rescorer implementation, whereas the only constructor of the class takes a CollectorManager instance, which is specific to parallel segments traversal. Should the class say that it is used for parallel rescoring of hits across multiple segments?

Yes, it takes CollectorManager instance and it is all it needs. Also It makes sense to me that constructor takes CollectorManager because without it, there is no point of creating CollectorRescorer. Also as you see It doesn't necessarily only run in parallel. if ExecutorService is null then it runs sequential, so calling it parallel would be wrong. JavaDoc on constructor highlights this.

I am a bit wary of the way the patch maps the global jobs array to jobs associated with segments belonging to one slice. In particular, buildJobs method does not consider leaf ordinal during assigning values to the global array, but rescore depends on that during retrieval.

Good one, Corrected it to use ord based assignment.

Thanks.

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

Hi Elbek Kamoliddinov

 

Yes, it takes CollectorManager instance and it is all it needs. Also It makes sense to me that constructor takes CollectorManager because without it, there is no point of creating CollectorRescorer. Also as you see It doesn't necessarily only run in parallel. if ExecutorService is null then it runs sequential, so calling it parallel would be wrong. JavaDoc on constructor highlights this.

Ok, I am fine with it as long as it is clearly advertised (maybe add it in the class heading's Javadocs itself?)

Corrected it to use ord based assignment.

Could you add a test for this?

I will take another deeper look into the updated patch today and post comments.

asfimport commented 5 years ago

Elbek Kamoliddinov (migrated from JIRA)

testMultiSegment is creating multiple segments, also testRandom does commit with conditioned on rarely() to have multisegment index. So both of these tests should be sufficient? 

Thanks again looking into it. 

 

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I like the idea of a collector-based rescorer so that you can do things that you can't do with queries, such as ensuring there is some diversity on the first page (we have a DiversifiedTopDocsCollector). However I'm curious about the fact that this rescorer takes an executor, are you running rescorers that are very CPU-intensive?

asfimport commented 5 years ago

Elbek Kamoliddinov (migrated from JIRA)

Yes, We run CPU intensive ML models to rescore hits, it always runs with an executor. Idea is we first collect top X hits with static scorer then run this rescorer to X to reduce it to say top 20 best hits.

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

Hi Elbek Kamoliddinov

 

Thanks for another iteration. A few other comments:

 

1) In buildSegmentScorers, could we rename maxDoc as endDoc? Since maxDoc is what is used to typically denote the number of documents in a segment, the code can be conflicting to read.

 

2) When mapping LeafSlices to SegmentScorers, we do another loop post the construction of SegmentScorers. Is there a way where we could do the mapping in buildSegmentScorers itself at the time of scorer task allocation?

 

3) I am a bit unsure about how the rescorer method works. If IndexSearcher has non empty slices, then that implies that a non null ExecutorService was passed in to IndexSearcher. Is there any correlation to that and CollectorRescorer using an ExecutorService?

If yes, then a plain assert to check that ExecutorService is not null should suffice and there is no need for the if-else block.

If not, then we should ignore IndexSearcher's LeafSlices when CollectorRescorer is going single threaded i.e. we should not do the slices to segment mapping, since we anyways are not going to use that.

 

4) Could you please break up rescorer method into more modular objects? Ideally, a method each for sequential and parallel cases. Also, there is some code duplication at the tail where you do the sequential execution, that should get eliminated.

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

Thinking more about this, using LeafReaderContext.ord directly as positional parameter might be risky since LeafReaderContext has a constructor where it sets the ord to 0. Ideally that should not happen given that these segments are already scored once, but that might be one case worth thinking about?

asfimport commented 5 years ago

Elbek Kamoliddinov (migrated from JIRA)

Sorry, I was bit busy and could not work on feedback. I created another iteration with some changes.

1) In buildSegmentScorers, could we rename maxDoc as endDoc? Since maxDoc is what is used to typically denote the number of documents in a segment, the code can be conflicting to read.

I left the name as maxDoc, this is denoting max doc and used to check if we went beyond it so we advance. Is it convincing?

When mapping LeafSlices to SegmentScorers, we do another loop post the construction of SegmentScorers. Is there a way where we could do the mapping in buildSegmentScorers itself at the time of scorer task allocation?

The thing is slice may contain out of order segments, for example slice 0 may contain segment 0 and segment 3. So we have to do full iteration here first then map segments to the slices.

3) I am a bit unsure about how the rescorer method works. If IndexSearcher has non empty slices, then that implies that a non null ExecutorService was passed in to IndexSearcher. Is there any correlation to that and CollectorRescorer using an ExecutorService?

I think it depends on use case, one may want to run index searcher with executor but wants this run sequential? I made some changes there, so now slice works only with executor, if executor is null then slice is ignored. Also if slice is null but executor is not null (and segment count is >1) then resocres each segment in a separate thread.

4) Could you please break up rescorer method into more modular objects? Ideally, a method each for sequential and parallel cases. Also, there is some code duplication at the tail where you do the sequential execution, that should get eliminated.

Done, but I did it based on slice or segment based. Because in both new methods it maybe possible that execution happens in parallel.

Thinking more about this, using LeafReaderContext.ord directly as positional parameter might be risky since LeafReaderContext has a constructor where it sets the ord to 0. Ideally that should not happen given that these segments are already scored once, but that might be one case worth thinking about?

I don't think I understood this concern. You are saying there might be duplicate leaf with 0 ord or there might gap in the leaf ord? I don't think either one is possible. A lot of internal classes uses ord in such way, for example OrdinalMap.

Let me know what you think.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Yes, We run CPU intensive ML models to rescore hits, it always runs with an executor.

I understand how ML models can be costly to build, but running them against a small set of hits should perform reasonably fast?

using LeafReaderContext.ord directly as positional parameter might be risky since LeafReaderContext has a constructor where it sets the ord to 0

This constructor is pkg-private and only used for single-segment reader contexts, such as the ones that you get from LeafReader#getContext. Reader context ords should be reliable.

asfimport commented 5 years ago

Elbek Kamoliddinov (migrated from JIRA)

I understand how ML models can be costly to build, but running them against a small set of hits should perform reasonably fast

Sometimes this set is not that small, Documents may be enriched during indexing time, hence too many may match. We do early termination when we have to many results, but still this upper bound is bit costly when running sequential. We tested whole query running pipeline running a query per thread and had mixed results, but overall multi threaded querying serving offered better result.

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

We do early termination when we have to many results, but still this upper bound is bit costly when running sequential

Interesting. How do you early terminate across multiple slices running concurrently?

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Sometimes this set is not that small, Documents may be enriched during indexing time, hence too many may match.

We don't do anything to prevent from using rescorers on lots of hits, but I'd rather not optimize for this use-case. In my opinion, rescorers should be used on the very top hits only.

asfimport commented 5 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In my opinion, rescorers should be used on the very top hits only.

Is the concern that this rescorer optionally accepts ExecutorService to distribute the work across threads?

Maybe we could just add another ctor not taking ExecutorService for those use cases that want to run single-threaded?

 

asfimport commented 5 years ago

Elbek Kamoliddinov (migrated from JIRA)

After MIA, I am back at this. I was thinking along the line Mike, but even more brutal. So the change was just to have one sole ctor which takes single CollectorManager param. There is setter for setting ExecutorService object. This way single running single threaded will get first class treatment? I have attached an updated patch.

@atris

Interesting. How do you early terminate across multiple slices running concurrently?

We distribute total number of results we are looking from matching across segments evenly plus some static number for overhead (I think) and terminate when each segment collects enough docs.

asfimport commented 5 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

My concern is that we are introducing complexity for a use-case that seems to be collecting a number of hits in the first pass that is above what I would consider reasonable. I understand why you might want to set an executor service on an IndexSearcher since it needs to do work over potentially millions of hits. However, the order of magnitude of the number of documents that I'm expecting rescoring to have to look at is more in the order of hundreds and should run in a few milliseconds only already with a single thread.

That concern aside, I'm supportive of the ability of rescoring hits based on a collector, which adds flexibility compared to what you can do with a query rescorer. I won't veto this change, but could we make the single-threaded constructor take a Collector rather than a CollectorManager and document that the one that takes an ExecutorService is expert and usually doesn't help significantly?

We distribute total number of results we are looking from matching across segments evenly plus some static number for overhead

This way of working would be a nice enhancement to IndexSearcher when constructed with an ExecutorService! Do you just accept the decrease of precision if one slice didn't collect enough hits, or do you re-run the search on this slice with a greater number of hits?

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

could we make the single-threaded constructor take a Collector rather than a CollectorManager and document that the one that takes an ExecutorService is expert and usually doesn't help significantly?

That has also been one of my main gripes with the approach that this patch takes – the default interface takes an ExecutorManager. Maybe it is just a matter of personal taste, but it still catches the eye. It would be great if we could improve upon that

asfimport commented 5 years ago

Elbek Kamoliddinov (migrated from JIRA)

Do you just accept the decrease of precision if one slice didn't collect enough hits, or do you re-run the search on this slice with a greater number of hits?

We don't rerun, just accept the result. This approach has been working fine.

I won't veto this change, but could we make the single-threaded constructor take a Collector rather than a CollectorManager and document that the one that takes an ExecutorService is expert and usually doesn't help significantly?

Hmm, how do we return TopDocs from Collector interface? As I understand Collector/LeafCollector doesn't make any assumption on how collected docs sorted, scored etc. So is it even possible to build TopDocs from Collector? With CollectorManager user plugs in this logic. Taking Collector object in the constructor feels like bad api and feels confusing. I added Expert comment in the last patch. So patch has a sole ctor:  

public CollectorRescorer(CollectorManager<C, TopDocs> manager) {
  this.manager = manager;
}
asfimport commented 5 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

the default interface takes an ExecutorManager.

Hmm did you mean ExecutorService or CollectorManager here?  There is no ExecutorManger that I can see.

+1 to mark the ExecutorService ctor/setter as expert w/ javadocs that explain that it is not often needed to distribute collection work across concurrent threads.

asfimport commented 5 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

We distribute total number of results we are looking from matching across segments evenly plus some static number for overhead

I think this is the same pro-rated idea from #9727; when the documents are randomly distributed among segments, the prediction can be quite accurate. In the case of a time series index though (eg, or any index where the distribution among segments is correlated with the rank), then this approach to early termination is not directly applicable.

asfimport commented 5 years ago

Elbek Kamoliddinov (migrated from JIRA)

Happy to address feed-backs for the last patch if any.

asfimport commented 5 years ago

Atri Sharma (@atris) (migrated from JIRA)

I will take a look today