castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 457 forks source link

ClassCastException when indexing ACL Anthology #2069

Closed ygorg closed 1 year ago

ygorg commented 1 year ago

When following the "Indexing the ACL Anthology with Anserini" the actual indexing raises the following traceback (see AclAnthology.java:158):

java.lang.ClassCastException: class com.fasterxml.jackson.databind.node.TextNode cannot be cast to class com.fasterxml.jackson.databind.node.ArrayNode (com.fasterxml.jackson.databind.node.TextNode and com.fasterxml.jackson.databind.node.ArrayNode are in unnamed module of loader 'app')
    at io.anserini.collection.AclAnthology$Document.<init>(AclAnthology.java:158) ~[anserini-0.20.0-fatjar.jar:?]
    at io.anserini.collection.AclAnthology$Segment.readNext(AclAnthology.java:115) ~[anserini-0.20.0-fatjar.jar:?]
    at io.anserini.collection.FileSegment$1.hasNext(FileSegment.java:136) ~[anserini-0.20.0-fatjar.jar:?]
    at io.anserini.index.IndexCollection$LocalIndexerThread.run(IndexCollection.java:298) [anserini-0.20.0-fatjar.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:829) [?:?]

There seem to have been a change on how the venues are processed in the acl-org/acl-anthology which breaks anserini's collection.ACLAnthology.

My use case is using [castorini/covidex]() with ACL documents. My workaround was to index the bibtex of the aclanthology, but there is a lot of brackets and LaTeX things in the text, so I'd rather go with this solution.

Steps to reproduce:

git clone https://github.com/acl-org/acl-anthology
conda create -n acl_anth python=3.8
conda activate acl_anth
cd acl-anthology
pip install -r bin/requirements.txt
python bin/create_hugo_yaml.py 

pip install pyserini
python -m pyserini.index -collection AclAnthology -generator AclAnthologyGenerator -threads 8 -input build/data/ -index index/lucene-index-acl-paragraph -storePositions -storeDocvectors -storeContents -storeRaw -optimize

But everything works well when the acl-anthology version used is close to the creation of the "Indexing the ACL Anthology with Anserini" tutorial.

git clone https://github.com/acl-org/acl-anthology
git checkout -b same_date 9b3f001d2e705d6751118046643de71075836379
# 16/04/2020 acl-anthology commit 9b3f001d2e705d6751118046643de71075836379
# 07/04/2020 creation of tutorial in anserini https://github.com/castorini/anserini/blob/master/docs/acl-anthology.md
ygorg commented 1 year ago

It seems that now the the name of the venues is stored in volume.get("venue"), and that volume.get("venues") is a yaml pointer (?) to volume.get("venue").

lintool commented 1 year ago

hi @ygorg thanks for your interest in Anserini and apologies for the late reply on this. If you've figured out the fix, can you perhaps send a PR?