Closed abhirut closed 4 years ago
This is an underlying Lucene setting...
As for suggestions, if you could describe what you're trying to do, that would help.
Finally -
If you've found Anserini to be helpful, we have a simple request for you to contribute back. In the course of replicating baseline results on standard test collections, please let us know if you're successful by sending us a pull request with a simple note, like what appears at the bottom of the Robust04 page. Replicability is important to us, and we'd like to know about successes as well as failures. Since the regression documentation is auto-generated, pull requests should be sent against the raw templates. In turn, you'll be recognized as a contributor.
Please consider contributing.
Sure, thank you.
I have an index of documents which I'm trying to query with really long questions. A snippet from my tsv file with queries -
109846 I have just installed Kubuntu on ASUS U46E BAL7 . The installation went fine and I am able to use the desktop . But it seems to recognize only one monitor at a time , either the laptop or the secondary one and not both . Besides this the resolution being shown is far less than the capability of either monitors . I was able to generate a xorg.conf file , but that seems to be mostly empty . I am skeptical about editing it and screwing up what 's working now . Is there a way to dump existing configuration that kubuntu is using into a file , so I can work by modifying those settings as opposed to writing everything by hand ? Please find my xorg.conf file attached below . Thanks million in advance . Section " ServerLayout " Identifier " X.org Configured " Screen 0 " Screen0 " 0 0 Screen 1 " Screen1 " RightOf " Screen0 " Screen 2 " Screen2 " RightOf " Screen1 " InputDevice " Mouse0 " " CorePointer " InputDevice " Keyboard0 " " CoreKeyboard " EndSection Section " Files " ModulePath " /usr / lib / xorg / modules " FontPath " /usr / share / fonts / X11/misc " FontPath " /usr / share / fonts / X11/cyrillic " FontPath " /usr / share / fonts / X11/100dpi/:unscaled " FontPath " /usr / share / fonts / X11/75dpi/:unscaled " FontPath " /usr / share / fonts / X11/Type1 " FontPath " /usr / share / fonts / X11/100dpi " FontPath " /usr / share / fonts / X11/75dpi " FontPath " /var / lib / defoma / x - ttcidfont - conf.d / dirs / TrueType " FontPath " built - ins " EndSection Section " Module " Load " glx " Load " dbe " Load " dri " Load " record " Load " dri2 " Load " extmod " EndSection Section " InputDevice " Identifier " Keyboard0 " Driver " kbd " EndSection Section " InputDevice
As you can imagine some of these questions can be over 1024 tokens, and the command I use for retrieval -
nohup target/appassembler/bin/SearchCollection -topicreader Tsv -index <index_loc> -topics <tsv_file_location> -output ./askubuntu/train.run -hits 10 -bm25 &
Stops retrieving when it encounters this error.
My questions are -
SearchCollection
to ignore errors and continue processing the rest of the file?Thank you for your super quick reply.
It's unlikely that such long queries will give you good quality results, so I think your best best is to shorten the queries. Simple truncation is a reasonable baseline. Better would be term selection (keyphrase extraction) by some algorithm; tf.idf is a good start.
Hi @lintool I completely agree! This argument is in fact the basis for my current research. I also wanted to study the drift in results from such long queries.
Having seen no follow up, I'm closing this issue.
@lintool, could we please reopen this issue? I'm working on a task in which more query terms lead to higher effectiveness, but I couldn't find a way to increase maxClauseCount in anserini/pyserini.
hey @rodrigonogueira4 what do you need? Is this an issue of exposing the setting in Anserini/Pyserini, or is this a fundamental Lucene limitation?
It is an issue in exposing the setting in Anserini/Pyserini. It would also work if the default is set to a larger number, such as 4096.
I wonder why Lucene has a limit for this (optimization, perhaps?).
Do you want to send PR?
Sure, I will start working on it.
Not having heard any follow up, closing issue. Please re-open otherwise.
Hi @lintool, I also run into this error now with latest pyserini. I think this error will occur more often if people use learned sparse methods like Splade (that's what I'm doing now), as queries could have many expended terms.
hey @ArvinZhuang which SPLADE model?
Hi @lintool. It is my trained splade model :) I currently use top-k tokens to avoid this issue, but in my case k larger is better, and at some point this error occurs.
But if you're bumping into this limit, then it probably means your model isn't very efficient?
Sounds like for this use case you really want score-at-a-time?
yeah, my splade is not that 'sparse'...
Hi @lintool, the same error also occurred when I was using Pyserini to do BM25 search on extremely long documents (average 5k words) such as legal documents, is there any way to solve this error or remove the length limits? Thanks:)
hi @yanran-tang this is a limit on queries, right? So if you're bumping up against this limitation, you might want to consider your approach... do you really need queries that are so long?
Hi @lintool, thanks for your reply and I agree with you. But very unfortunately, it is the baseline setting and the reviewers are asking for this. If you could help with this setting, it will be of great help. Thank you.
What I would do: sort terms by some simple statistic (e.g., tf-idf) and truncate at the limit. Explain that it's a technical limitation of the system. This is no different from truncating passages due to context window limitations of BERT.
Thank you for your suggestion:)
Hi,
I get the following error for longer queries -
Is this a parameter I can set while building the index? Suggestions for getting around this apart from query truncation?