apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Add search timeout support to Lucene [LUCENE-997] #2073

Closed asfimport closed 16 years ago

asfimport commented 17 years ago

This patch is based on Nutch-308.

This patch adds support for a maximum search time limit. After this time is exceeded, the search thread is stopped, partial results (if any) are returned and the total number of results is estimated.

This patch tries to minimize the overhead related to time-keeping by using a version of safe unsynchronized timer.

This was also discussed in an e-mail thread. http://www.nabble.com/search-timeout-tf3410206.html#a9501029


Migrated from LUCENE-997 by Sean Timm, 3 votes, resolved Feb 12 2008 Attachments: HitCollectorTimeoutDecorator.java, LuceneTimeoutTest.java (versions: 2), MyHitCollector.java, timeout.patch (versions: 8), TimerThreadTest.java Linked issues:

asfimport commented 17 years ago

Sean Timm (migrated from JIRA)

Patch against trunk revision 575451.

asfimport commented 17 years ago

Sean Timm (migrated from JIRA)

Simple test case. Run by passing in the index directory as an argument.

asfimport commented 17 years ago

Sean Timm (migrated from JIRA)

Here are some additional details on the changes.

New files: TimeLimitedCollector.java

Extends HitCollector and detects timeouts resulting in a TimeLimitedCollector.TimeExceeded exception being thrown.

TimerThread.java

TimerThread provides a pseudo-clock service to all searching threads, so that they can count elapsed time with less overhead than repeatedly calling System.currentTimeMillis.  A single thread should be created to be used for all searches.

Modified Files: Hits.java

Added partial result flag.

IndexSearcher.java

Catches TimeLimitedCollector.TimeExceeded, sets partial results flag on TopDocs and estimates the total hit count (if we hadn't timed out partway through).  Returns TopDocs with partial results.

Searcher.java

Added methods to set and get the timeout parameters.  This implementation decision has the limitation of only permitting a single timeout value per Searcher instance (of which there is usually only one).  However, this greatly minimizes the number of search methods that would need to be added.  In practice, I have not needed the functionality to change the timeout settings on a per query basis.

TopFieldDocCollector.java

Uses TimeLimitedCollector functionality.

TopDocCollector.java

Uses TimeLimitedCollector functionality and exposes it to child class TopFieldDocCollector.

TopDocs.java

Added partial results flag.  Note, TopFieldDocs extends this class and inherits the new functionality.
asfimport commented 17 years ago

Daniel Naber (migrated from JIRA)

Thanks for the patch. I didn't have a very close look, just one small thing: it's probably no good idea to catch and ignore the InterruptedException. See http://www-128.ibm.com/developerworks/java/library/j-jtp05236.html

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I'm not entirely convinced it makes sense to modify these classes to include timeouts as core functionality ... would it make more sense to deal with this in subclasses of IndexSearcher/TopDocs/Hits ?

asfimport commented 17 years ago

Sean Timm (migrated from JIRA)

http://www-128.ibm.com/developerworks/java/library/j-jtp05236.html

TimerThread Now follows Brian Goetz's best practice for a noncancelable task that restores interrupted status before returning rather than ignoring the InterruptedException.

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

Two issues are addressed in this latest patch:

1) Timeout support was not added to: public TopFieldDocs search(Weight weight, Filter filter, final int nDocs, Sort sort)

2) getCounter() in TimerThread was replaced by getMilliseconds()

asfimport commented 16 years ago

Doron Cohen (migrated from JIRA)

I think this is a nice feature to have.

But I am not sure about the propsed API - the application creates a TimerThread, starts it, and the timer thread is then passed to the searcher with setTimeOut(timer,ticks). Not so simple.

I think my preference for the API and implementation would be in HitCllector.collect() - in other words, we consider this new feature as an advanced one, and so only allow applications to provide their "timed" hitCollector. The modified collect() would either throw a TimeoutException or return a timedOut indication. If this is a (subclass of) RunTimeException (thuogh I am not crazy about this alternative) then there's no API change (a plus) but we need to verify that the code below propagates the RuntimeException gracefully and closes all the streams and everything (which I believe it does with all last careful changes by Mike and Michael). If RuntimeEXception is not acceptable, then this is an API change (a minus) and also many (simple) changes will be required in scorers (callers to collect).

The application's timedColletor will have all the logic in that collector for both announcing and detecting the timeout. Next we can add a TimedCollector for the benefit of applications, and last, consider adding search() methods with timeOut, but I doubt that last step.

asfimport commented 16 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> allow applications to provide their "timed" hitCollector

+1

asfimport commented 16 years ago

Lance Norskog (migrated from JIRA)

I just requested a more fancy feature in the Solr Jira. My apologies, I did not think to search the Lucene Jira.

1) timeout: stop searching after N milliseconds and return results using only those hits already found 2) hit limit: stop searching after N milliseconds and return results using only those hits already found 3) ram limit: estimate the amount of ram used so far and stop searching at a given amount

Here is the complete request:

asfimport commented 16 years ago

Lance Norskog (migrated from JIRA)

I stumbled above; I do not yet know Jira :) The Solr code is SOLR-392.

This request is inspired by a public search engine with millions of records. There are three different aspects mentioned above that can cause a query to "go rogue": timing out, finding too many records to give a truly useable result, and using up too much memory. The point is that if a search is going to find 14 million hits, Google does not go and tally them. It stops quickly and estimates how many might remain. I would like to have similar control.

The HitCollector implementation mentioned above would allow all three of these control options. If they could be pipelined together we could use any or all of them.

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

> I think my preference for the API and implementation would be in HitCollector.collect()

This would be simpler, but I don't see how it would be possible to estimate the total number of results and return partial results in that case. I think that is an important feature.

If the concern is complexity for the application, perhaps it is possible to hide the TimerThread altogether. The TimerThread could be created and started via a searcher setTimeOut(tick, numTicks) method.

To simplify it further, ticks could be fixed at a reasonable number, e.g., 100 ms, and a timeout in milliseconds could be passed in: setTimeout(milliseconds).

asfimport commented 16 years ago

Doron Cohen (migrated from JIRA)

TimerThread provides a pseudo-clock service to all searching threads, so that they can count elapsed time with less overhead than repeatedly calling System.currentTimeMillis. A single thread should be created to be used for all searches.

Is this really faster than calling System.currentTimeMillis()? I quick searched but found no references supporting this. This one says the opposite: http://www.devx.com/Java/Article/28685 Because if this is not the case, you could do without the TimerThread?

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think one benefit of a dedicated timer thread is not being affected by clock shift on the machine. System.currentTimeMillis() is not guaranteed to be monotonic. System.nanoTime() (1.5 only!) tries to be (I think), but it's still not guaranteed.

Without a monotonic clock, if the clock shifts forward then it could timeout in-flight queries (way) too early.

But: what happens when TimerThread overflows the int (a 2*1024*1024*1024)? Is it the caller's job to deal with the wraparound?

asfimport commented 16 years ago

Doron Cohen (migrated from JIRA)

Nice, I didn't think of this.

So with this understanding the timer thread (with long vs int) makes sense while in Java 1.4, then in 1.5 System.nanoTime will do.

The suggested patch relied on collect() being called, and so if a scorer takes long going over all the posting lists but fails to find a single match after the time passed, the search operation will not be stopped. I guess it is a fair assumption that this would be very rare... (so would be a system clock shift... : - ) )

Also important to understand is what happens with IO resources once search is aborted with timeout exception. Current patch does not close the underlying streams (I mean IndexInput clones). I think this is ok, because once the search is aborted and there are no more references to the weights&scorers, the IndexInput clones would be eventually garbage collected. Others?

asfimport commented 16 years ago

Nadav Har'El (migrated from JIRA)

I'd like to add my 2 cents on this issue.

The more I use Lucene in various ways, the more I'm getting convinced that the "Hits" API should be de-emphasized (if not outright depracated), and users should be encouraged to use the HitCollector API (search() taking a hitcollector, TopDocCollector, and so on) - especially for advanced usage.

I believe that your TimeLimitedCollector is an interesting idea. However, there is simply no justification to change TopDocCollector, TopFieldDocCollector, Topdocs, Hits and Searcher. There's a MUCH simpler, and in my opinion cleaner, approach: Just make TimeLimitedCollector be a front-end for another collector (for example, TopDocCollector) which will be passed to TimeLimitedCollector's constructor. For every document that is presented to TimeLimitedCollector's collect(), it would call the inner collector's collect().

This way, without any changes to Lucene's core APIs, and just the addition of a new class, you can add the new functionality that you wanted. This, I think, is the beauty of the HitCollector interface over the "monolithic" Hits approach, and I think we should encourage this way of thinking instead of adding more and more features to Hits.

asfimport commented 16 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

make TimeLimitedCollector be a front-end for another collector (for example, TopDocCollector) which will be passed to TimeLimitedCollector's constructor.

+1

asfimport commented 16 years ago

Doron Cohen (migrated from JIRA)

Sean, can you revise your patch to follow the suggestions above?

That is, create a TimeLimitedCollector that takes and Collector parameter for its constructor. You should be able to hide all the TimerThread details (use long instead of int) within the implementation of this new collector class, and so when moving to Java5 the TimeThread can be replaced by nanoTime.

Then we can add to either core search or under contrib.

On a related point - I am not happy with programming by a runtimException - if others agree that this should become a standard functionality, how about modifying Collector.collect() to return a stop-or-continue indicator?

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

Thanks for all of the feedback. I'll take another stab at this. I'm on vacation now and probably won't have time until I get back. It'll probably be early Jan. before I have a new patch ready.

asfimport commented 16 years ago

Timo Nentwig (migrated from JIRA)

IMHO it definitely should be part of the core. Being able to control the runtime of queries/your ressources is crucial for every live system and I really wonder this it has taken so long to address this.

Otherwise I totally agree with Navdav: that Hits thingie is nice and fine for simple full-text queries but as soon as things become somewhat more complex you don't get around writing your own HitCollector (and do stuff like Facets).

I also strongly agree that the timeout HC should be implemented as an decorator (what's been called "front-end" here), I just quickly wrote an example and attached it (no, I'm not happy throwing an runtime exception either):

MyHitCollector hc = new MyHitCollector(); s.search(q, null, HitCollectorTimeoutDecorator.decorate( hc, 10 ) );

And finally, why return partial results? I don't think that this is reasonable.

BTW I'm not sure whether volatile in the timer thread is really working reliably in 1.4...

asfimport commented 16 years ago

Timo Nentwig (migrated from JIRA)

Example of the timeout HitCollector implemented as an decorator.

asfimport commented 16 years ago

Timo Nentwig (migrated from JIRA)

Example HitCollector to be decorated by HitCollectorTimeoutDecorator.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

By using only a HitCollector it cannot be guaranteed that searches will not take too long. The reason is that there are searches that take a long time but do not collect any docs.

For example consider the case where every even doc has term A and every odd doc has term B, and the query requires both A and B. This is going to take an amount of time that is linear in the size of the index but no document will be collected.

This means that every conjunction (boolean queries with more than one required clause, phrase queries, span near queries) will need to check for timeout. Also a HitCollector with a timeout facility will need a way to be informed of a timeout when no document is collected.

asfimport commented 16 years ago

Timo Nentwig (migrated from JIRA)

True, unfortunately, but still better than nothing (->current situation). This approach isn't very precise in matters of timing either. Also, throwing a RuntimeException feels more like a hack than well thought out code...

I don't know Lucene's code good enough in order to estimate whether it's possible to build a real timeout machanism at all/without changing the API/rewriting a lot of code but it's incredibly important to be able to cancel running queries. You don't want to servers under high load suffering from lucene queries running up to multiple minutes at the same time consuming quite a lot of memory. And it makes no sense either because nobody is waiting so long for results...

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

Created a patch using Timo's HitCollectorTimeoutDecorator. This patch just has a few mostly minor changes. The biggest changes are that it uses a long instead of an int for the counter in TimerThread and the TimerThread interval is fixed at 10 ms. It also throws a TimeLimitedCollector.TimeExceeded exception on a timeout.

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

Updated bad patch with good copy.

Created a patch using Timo's HitCollectorTimeoutDecorator. This patch just has a few mostly minor changes. The biggest changes are that it uses a long instead of an int for the counter in TimerThread and the TimerThread interval is fixed at 10 ms. It also throws a TimeLimitedCollector.TimeExceeded exception on a timeout.

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

Updated to work with latest patch.

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

The TimerThreadTest illustrates the accuracy of the TimerThread under load. On my 2GHz Xeon 4 CPU dual core RH AS 4 Linux box, it get a 20% difference between the TimerThread implementation and System.currentTimeMillis() and is independent of load.

java TimerThreadTest 8 [...] 10010 12020 [...]

With my single core single CPU Windows XP laptop I see a 20% difference at load, but when adding additional threads, I see an increasing difference.

java TimerThreadTest 0 [...] 10000 11819 [...]

java TimerThreadTest 2 [...] 10040 18890 [...]

asfimport commented 16 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

I believe this version of the patch won't work properly unless you add synchronization - writes to long values are non-atomic (Java Language Specification 17.7, as the comment says), that's why Nutch uses an int there. Perhaps using AtomicLong would be an answer, if you really need a long value.

asfimport commented 16 years ago

Nicolas Lalevée (migrated from JIRA)

AtomicLong is a Java 1.5 feature, so it doesn't fit either.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

After #1662 more work will be needed to get all conjunctions in the same place, but it is a starting point.

Once all conjunctions are in the same place, it would make sense to add a timeout there.

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

Andrzej--

JLS 17.7 Non-atomic Treatment of double and long

"Writes and reads of volatile long and double values are always atomic."

asfimport commented 16 years ago

Andrzej Bialecki (@sigram) (migrated from JIRA)

Indeed, thanks for the correction - I forgot about the special treatment of volatile values.

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

You are right that I forgot to change the comment when I changed it from an int to a long though. "* updates to 32-bit-sized variables are atomic" is a pretty pointless comment now. :-)

asfimport commented 16 years ago

Timo Nentwig (migrated from JIRA)

@Sean: "The biggest changes are that it uses a long instead of an int for the counter in TimerThread" - didn't I use a volatile long already? :)

I hope this will become part of the next release. IMHO this is Priority Major or above...

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

In the timeout.patch, instead of this:

time += resolution;

I'd rather have this:

time = System.currentTimeMillis();

because all of the wait() methods can become unreliable, especially at high loads.

With (or without) this change, 100 msecs or even 200 msecs could be used as the update frequency instead of the current 10 msecs.

By computing the time out moment up front, one subtraction can be saved at each document collection. Then only TIMER_THREAD.getMilliseconds() method is needed at document collection time, and the getElapsedMilliseconds() method is superfluous.

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

@Timo: "didn't I use a volatile long already?" Indeed. I guess that wasn't a very big change then. ;-)

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

This is a minor update to timeout.patch which fixes the comment about updates to 32-bit-sized variables being atomic and instead talks about volatile longs, as pointed out by Andrzej. It also computes the time out moment up front to save a subtraction on each document collection as suggested by Paul.

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

I could go either way on the System.currentTimeMillis() versus a TimerThread issue. I agree nanoTime is the correct implementation when using 1.5.

It doesn't seem like on Linux running ntp it matters much either way. NTP tries to perform smoothing and makes clock changes slowly over a longer period of time when it can rather than have an abrupt change, but YMMV if your system is having clock issues. On a really overloaded Windows box, the TimerThread implementation will not behave well as demonstrated above. I can't speak to any other platforms.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

The idea of System.currentTimeMillis() is to guard against misbehaviour of the java wait() method and against unexpected delays because of thread scheduling during the jump back for the loop around the wait() call. One way to reduce such misbehaviour under heavy load is by increasing the scheduling priority of the timing thread, but I don't think that is necessary.

Also System.currentTimeMillis() is obviously correct, whereas timeout += resolution is never more than an assumption about correct wait() behaviour.

Clock changes by NTP are normally so slow that they don't really matter for query time outs.

asfimport commented 16 years ago

Doron Cohen (migrated from JIRA)

Sean, can you add a Junit test to timeout.patch?

I think such test should check (1) search correctness (regardless of timeout), (2) expected timeout behavior, and (3) some sanity test with multiple searching threads. This can also serve as an example for using this new collector.

Cheers. Doron

asfimport commented 16 years ago

Timo Nentwig (migrated from JIRA)

IIRC I did time=System.currentTimeMillis() first but was surprised how slow this method actally is.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

Would that still be a problem with a 200ms resolution?

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

Patch adds JUnit test cases as suggested by Doron.

asfimport commented 16 years ago

Timo Nentwig (migrated from JIRA)

200ms? No, probably not. I don't recall what resolution I used in my test but actually the timeout check took more time than the Lucene query...

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

Paul, I think that if we were to use System.currentTimeMillis(), we would eschew the TimerThread as Doron suggests in his Dec. 15 comment. I haven't seen any performance issues with System.currentTimeMillis().

As far as 200ms, I think that is too large of a default resolution (and with the current implementation it is not configurable). With a 200 ms resolution, a query with a 1 second time allowed could timeout in 800 ms, and one with a time allowed of 500 ms could timeout in 300 ms. I think it is much worse to timeout a query early than to timeout late.

asfimport commented 16 years ago

Doron Cohen (migrated from JIRA)

Sean thanks for adding the test.

In the attached I tightened the check of allowed elapsed time until timeout. Also added info in the exception, and added ability to modify the resolution - default is 20ms (was 5ms). Please let me know what you think.

As for System.currentTimeMillis() vs. Timer thread - IMHO Mike's comment on 'system clock changes' makes the timer thread favorable.

I checked this with up to 10,000 threads and with that number the test sometimes fails because it is quite tight on the max elapsed time required comparing to the timeout, so I don't see this is a problem. In the attached N_THREADS = 50 and this number of threads always passes for me.

If there are no more major concerns I think this is now ready to go in, question is where to - under core o.a.l.search or under contrib (query or misc). Others?

asfimport commented 16 years ago

Sean Timm (migrated from JIRA)

Doron, your comment for setResolution(long) says "The default timer resolution is 50 milliseconds", however, the default is actually 20 ms (public static final int DEFAULT_RESOLUTION = 20;). Other than that, everything looks great.

asfimport commented 16 years ago

Doron Cohen (migrated from JIRA)

Oh wrote comment that was before I decided to change the default... Thanks for catching this.

asfimport commented 16 years ago

Doron Cohen (migrated from JIRA)

Attached patch corrects default resolution comment.