castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.66k stars 371 forks source link

Unable to read topics: MIRACL_V10_FI_DEV java.io.IOException #1850

Closed wxywb closed 6 months ago

wxywb commented 6 months ago

I executed following command

Press ENTER or type command to continue
python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --language fi \
  --topics miracl-v1.0-fi-dev \
  --index miracl-v1.0-fi \
  --output run.miracl.bm25.fi.dev.txt2 \
  --bm25 --hits 1000

Traceback (most recent call last): File "/home/xuyu/anaconda3/envs/pyserini/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/xuyu/anaconda3/envs/pyserini/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/xuyu/anaconda3/envs/pyserini/lib/python3.10/site-packages/pyserini/search/lucene/main.py", line 152, in query_iterator = get_query_iterator(args.topics, TopicsFormat(args.topics_format)) File "/home/xuyu/anaconda3/envs/pyserini/lib/python3.10/site-packages/pyserini/query_iterator.py", line 187, in get_query_iterator return mapping[topics_format].from_topics(topics_path) File "/home/xuyu/anaconda3/envs/pyserini/lib/python3.10/site-packages/pyserini/query_iterator.py", line 104, in from_topics topics = get_topics(topics_path) File "/home/xuyu/anaconda3/envs/pyserini/lib/python3.10/site-packages/pyserini/search/_base.py", line 583, in get_topics topics = JTopicReader.getTopicsWithStringIds(topics_mapping[collection_name]) File "jnius/jnius_export_class.pxi", line 876, in jnius.JavaMethod.call File "jnius/jnius_export_class.pxi", line 1042, in jnius.JavaMethod.call_staticmethod File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception jnius.JavaException: JVM exception occurred: Unable to read topics: MIRACL_V10_FI_DEV java.io.IOException

I attempted to evaluate the Finnish ('fi') language in the MIRACL dataset, but encountered an error. Can someone give me some clue how topics are handled in Pyserini so that the JVM could experience such an issue? I ran the evaluation for the Arabic ('ar') language, and it worked fine. Thank you.

lintool commented 6 months ago

I just tried the command on master - works fine for me... what version are you on? A dev release? Or try v0.35.0?

Reopen issue if you're still have trouble?