Closed mobehbooei closed 1 year ago
Hi @mobehbooei, let me try to set up Solr and get back to you on that issue. In the meantime, you can build an index for the ACL Anthology collection using Pyserini as described in this guide. A sample of building the index using Pyserini can also be seen in the "Steps to reproduce" section of this issue
I think this is because the README is not up to date, since anserini moved to lucene 9.3 since 02/08/22 https://github.com/castorini/anserini/commit/27256551e958f39495b04e89ef55de9d27f33414) support for Elastic Search and Solr was dropped (https://github.com/castorini/anserini/pull/1951).
From the instruction you followed, you need to create the index using target/appassembler/bin/IndexCollection
without solr being involved.
You maybe can clone anserini from right before the 02/08/2023 (or anserini-0.14.4), but keeping the latest relevant source file for acl-indexing (see https://github.com/castorini/anserini/pull/2084). I have not tried that but was thinking of it, please let me know if it works !
Hi @mobehbooei - yes, support for Solr has been dropped, so we should go back to direct Anserini indexing. The commands on this issue should work: https://github.com/castorini/anserini/issues/2069
i.e.,
python -m pyserini.index -collection AclAnthology -generator AclAnthologyGenerator -threads 8 -input build/data/ -index index/lucene-index-acl-paragraph -storePositions -storeDocvectors -storeContents -storeRaw -optimize
Can you please try it out and then update this page accordingly? https://github.com/castorini/anserini/blob/master/docs/acl-anthology.md
Send PR directly please.
Hi everyone @ygorg @aryamancodes @lintool - Thanks for the responses. I tried the pyserini approach but I still have some issues same as #2069 I am getting this error first:
2023-05-01 16:36:51,108 ERROR [main] collection.AclAnthology (AclAnthology.java:60) - Unable to open volumes.yaml
and then lots of this error:
2023-05-01 16:36:51,582 ERROR [pool-2-thread-6] index.IndexCollection$LocalIndexerThread (IndexCollection.java:348) - pool-2-thread-6: Unexpected Exception:
java.lang.NullPointerException: null
at io.anserini.collection.AclAnthology$Document.<init>(AclAnthology.java:154) ~[anserini-0.21.0-fatjar.jar:?]
at io.anserini.collection.AclAnthology$Segment.readNext(AclAnthology.java:115) ~[anserini-0.21.0-fatjar.jar:?]
at io.anserini.collection.FileSegment$1.hasNext(FileSegment.java:136) ~[anserini-0.21.0-fatjar.jar:?]
at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:287) [anserini-0.21.0-fatjar.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Tried this and this to prevent yaml from creating aliases which can't be parsed by anserini but still have the same error! I am using WSL on my Windows and installed the pyserini package according to this detailed version. @aryamancodes as you mentioned here it worked for you, so do you have any idea what my problem is? tnx
You might not have the latest version the AclAnthology.java file. Because in the latest version the error message is more verbose. Try updating the file or cloning the latest version of anserini.
Thanks @ygorg. That worked. It needed the Development Installation of pyserini to have the latest versions.
Hi @ygorg and @aryamancodes I tried to follow https://github.com/castorini/anserini/blob/master/docs/acl-anthology.md but it seems something is missing in the instructions. first I did the Solr deploying instructions according to https://github.com/castorini/anserini/blob/master/docs/solrini.md . I executed
python bin/create_hugo_yaml.py
successfully and have the generated files in theacl-anthology/build/data/
directory. I tried to run thesh src/main/resources/solr/setup/acl-anthology.sh
step but but in the previous steps thesolr/setup
directory has not been created and I don't have theacl-anthology.sh
file anywhere!(also there is the same problem when I followed https://github.com/castorini/anserini/blob/master/docs/solrini.md in running
pushd src/main/resources/solr && ./solr.sh ../../../../solrini localhost:9983 && popd
)Am I missing some part of the instructions? I was wondering if you could help me through this?