castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.02k stars 445 forks source link

How to set maxClauseCount #745

Closed abhirut closed 4 years ago

abhirut commented 5 years ago

Hi,

I get the following error for longer queries -

org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is set to 1024

Is this a parameter I can set while building the index? Suggestions for getting around this apart from query truncation?

lintool commented 5 years ago

This is an underlying Lucene setting...

As for suggestions, if you could describe what you're trying to do, that would help.

Finally -

If you've found Anserini to be helpful, we have a simple request for you to contribute back. In the course of replicating baseline results on standard test collections, please let us know if you're successful by sending us a pull request with a simple note, like what appears at the bottom of the Robust04 page. Replicability is important to us, and we'd like to know about successes as well as failures. Since the regression documentation is auto-generated, pull requests should be sent against the raw templates. In turn, you'll be recognized as a contributor.

Please consider contributing.

abhirut commented 5 years ago

Sure, thank you.

I have an index of documents which I'm trying to query with really long questions. A snippet from my tsv file with queries -

109846  I have just installed Kubuntu on ASUS U46E BAL7 . The installation went fine and I am able to use the desktop . But it seems to recognize only one monitor at a time , either the laptop or the secondary one and not both . Besides this the resolution being shown is far less than the capability of either monitors .     I was able to generate a xorg.conf file , but that seems to be mostly empty . I am skeptical about editing it and screwing up what 's working now . Is there a way to dump existing configuration that kubuntu is using into a file , so I can work by modifying those settings as opposed to writing everything by hand ?   Please find my xorg.conf file attached below .    Thanks million in advance .    Section " ServerLayout "   Identifier      " X.org Configured "   Screen       0   " Screen0 " 0 0   Screen       1   " Screen1 " RightOf " Screen0 "   Screen       2   " Screen2 " RightOf " Screen1 "   InputDevice     " Mouse0 " " CorePointer "   InputDevice     " Keyboard0 " " CoreKeyboard " EndSection   Section " Files "   ModulePath    " /usr / lib / xorg / modules "   FontPath      " /usr / share / fonts / X11/misc "   FontPath      " /usr / share / fonts / X11/cyrillic "   FontPath      " /usr / share / fonts / X11/100dpi/:unscaled "   FontPath      " /usr / share / fonts / X11/75dpi/:unscaled "   FontPath      " /usr / share / fonts / X11/Type1 "   FontPath      " /usr / share / fonts / X11/100dpi "   FontPath      " /usr / share / fonts / X11/75dpi "   FontPath      " /var / lib / defoma / x - ttcidfont - conf.d / dirs / TrueType "   FontPath      " built - ins " EndSection   Section " Module "   Load   " glx "   Load   " dbe "   Load   " dri "   Load   " record "   Load   " dri2 "   Load   " extmod " EndSection   Section " InputDevice "   Identifier   " Keyboard0 "   Driver       " kbd " EndSection   Section " InputDevice

As you can imagine some of these questions can be over 1024 tokens, and the command I use for retrieval -

nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index <index_loc> -topics <tsv_file_location> -output ./askubuntu/train.run -hits 10 -bm25 &

Stops retrieving when it encounters this error.

My questions are -

  1. Is there anyway to change this Lucene setting?
  2. Else, is there a flag I can give to SearchCollection to ignore errors and continue processing the rest of the file?

Thank you for your super quick reply.

lintool commented 5 years ago

It's unlikely that such long queries will give you good quality results, so I think your best best is to shorten the queries. Simple truncation is a reasonable baseline. Better would be term selection (keyphrase extraction) by some algorithm; tf.idf is a good start.

abhirut commented 5 years ago

Hi @lintool I completely agree! This argument is in fact the basis for my current research. I also wanted to study the drift in results from such long queries.

lintool commented 5 years ago

Possibly related: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-423

lintool commented 5 years ago

Having seen no follow up, I'm closing this issue.

rodrigonogueira4 commented 4 years ago

@lintool, could we please reopen this issue? I'm working on a task in which more query terms lead to higher effectiveness, but I couldn't find a way to increase maxClauseCount in anserini/pyserini.

lintool commented 4 years ago

hey @rodrigonogueira4 what do you need? Is this an issue of exposing the setting in Anserini/Pyserini, or is this a fundamental Lucene limitation?

rodrigonogueira4 commented 4 years ago

It is an issue in exposing the setting in Anserini/Pyserini. It would also work if the default is set to a larger number, such as 4096.

I wonder why Lucene has a limit for this (optimization, perhaps?).

lintool commented 4 years ago

Do you want to send PR?

rodrigonogueira4 commented 4 years ago

Sure, I will start working on it.

lintool commented 4 years ago

Not having heard any follow up, closing issue. Please re-open otherwise.

ArvinZhuang commented 5 months ago

Hi @lintool, I also run into this error now with latest pyserini. I think this error will occur more often if people use learned sparse methods like Splade (that's what I'm doing now), as queries could have many expended terms.

lintool commented 5 months ago

hey @ArvinZhuang which SPLADE model?

ArvinZhuang commented 5 months ago

Hi @lintool. It is my trained splade model :) I currently use top-k tokens to avoid this issue, but in my case k larger is better, and at some point this error occurs.

lintool commented 5 months ago

But if you're bumping into this limit, then it probably means your model isn't very efficient?

Sounds like for this use case you really want score-at-a-time?

ArvinZhuang commented 4 months ago

yeah, my splade is not that 'sparse'...

yanran-tang commented 3 months ago

Hi @lintool, the same error also occurred when I was using Pyserini to do BM25 search on extremely long documents (average 5k words) such as legal documents, is there any way to solve this error or remove the length limits? Thanks:)

lintool commented 3 months ago

hi @yanran-tang this is a limit on queries, right? So if you're bumping up against this limitation, you might want to consider your approach... do you really need queries that are so long?

yanran-tang commented 3 months ago

Hi @lintool, thanks for your reply and I agree with you. But very unfortunately, it is the baseline setting and the reviewers are asking for this. If you could help with this setting, it will be of great help. Thank you.

lintool commented 3 months ago

What I would do: sort terms by some simple statistic (e.g., tf-idf) and truncate at the limit. Explain that it's a technical limitation of the system. This is no different from truncating passages due to context window limitations of BERT.

yanran-tang commented 3 months ago

Thank you for your suggestion:)