castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 457 forks source link

Problem with indexing ACLAnthology #2109

Closed mobehbooei closed 1 year ago

mobehbooei commented 1 year ago

Hi @ygorg and @aryamancodes I tried to follow https://github.com/castorini/anserini/blob/master/docs/acl-anthology.md but it seems something is missing in the instructions. first I did the Solr deploying instructions according to https://github.com/castorini/anserini/blob/master/docs/solrini.md . I executed python bin/create_hugo_yaml.py successfully and have the generated files in the acl-anthology/build/data/ directory. I tried to run the sh src/main/resources/solr/setup/acl-anthology.sh step but but in the previous steps the solr/setup directory has not been created and I don't have the acl-anthology.sh file anywhere!

(also there is the same problem when I followed https://github.com/castorini/anserini/blob/master/docs/solrini.md in running pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd)

Am I missing some part of the instructions? I was wondering if you could help me through this?

aryamancodes commented 1 year ago

Hi @mobehbooei, let me try to set up Solr and get back to you on that issue. In the meantime, you can build an index for the ACL Anthology collection using Pyserini as described in this guide. A sample of building the index using Pyserini can also be seen in the "Steps to reproduce" section of this issue

ygorg commented 1 year ago

I think this is because the README is not up to date, since anserini moved to lucene 9.3 since 02/08/22 https://github.com/castorini/anserini/commit/27256551e958f39495b04e89ef55de9d27f33414) support for Elastic Search and Solr was dropped (https://github.com/castorini/anserini/pull/1951). From the instruction you followed, you need to create the index using target/appassembler/bin/IndexCollection without solr being involved. You maybe can clone anserini from right before the 02/08/2023 (or anserini-0.14.4), but keeping the latest relevant source file for acl-indexing (see https://github.com/castorini/anserini/pull/2084). I have not tried that but was thinking of it, please let me know if it works !

lintool commented 1 year ago

Hi @mobehbooei - yes, support for Solr has been dropped, so we should go back to direct Anserini indexing. The commands on this issue should work: https://github.com/castorini/anserini/issues/2069

i.e.,

python -m pyserini.index -collection AclAnthology -generator AclAnthologyGenerator -threads 8 -input build/data/ -index index/lucene-index-acl-paragraph -storePositions -storeDocvectors -storeContents -storeRaw -optimize

Can you please try it out and then update this page accordingly? https://github.com/castorini/anserini/blob/master/docs/acl-anthology.md

Send PR directly please.

mobehbooei commented 1 year ago

Hi everyone @ygorg @aryamancodes @lintool - Thanks for the responses. I tried the pyserini approach but I still have some issues same as #2069 I am getting this error first:

2023-05-01 16:36:51,108 ERROR [main] collection.AclAnthology (AclAnthology.java:60) - Unable to open volumes.yaml

and then lots of this error:

2023-05-01 16:36:51,582 ERROR [pool-2-thread-6] index.IndexCollection$LocalIndexerThread (IndexCollection.java:348) - pool-2-thread-6: Unexpected Exception:
java.lang.NullPointerException: null
        at io.anserini.collection.AclAnthology$Document.<init>(AclAnthology.java:154) ~[anserini-0.21.0-fatjar.jar:?]
        at io.anserini.collection.AclAnthology$Segment.readNext(AclAnthology.java:115) ~[anserini-0.21.0-fatjar.jar:?]
        at io.anserini.collection.FileSegment$1.hasNext(FileSegment.java:136) ~[anserini-0.21.0-fatjar.jar:?]
        at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:287) [anserini-0.21.0-fatjar.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

Tried this and this to prevent yaml from creating aliases which can't be parsed by anserini but still have the same error! I am using WSL on my Windows and installed the pyserini package according to this detailed version. @aryamancodes as you mentioned here it worked for you, so do you have any idea what my problem is? tnx

ygorg commented 1 year ago

You might not have the latest version the AclAnthology.java file. Because in the latest version the error message is more verbose. Try updating the file or cloning the latest version of anserini.

mobehbooei commented 1 year ago

Thanks @ygorg. That worked. It needed the Development Installation of pyserini to have the latest versions.

lintool commented 1 year ago

Closing - ref: https://github.com/castorini/anserini/pull/2126 and https://github.com/castorini/pyserini/pull/1537