castorini / bertserini

BERTserini
https://github.com/castorini/bertserini
Apache License 2.0
25 stars 10 forks source link

lucene 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException when using self created corpus #32

Open Reijarmo opened 2 years ago

Reijarmo commented 2 years ago

Hello at all.

I tried to use Bertserini for question answering with a self created corpus. The base example works perfect (with transformers == 3.4.0), but I am not able to find a solution for the lucene problem. I know Bertserini depends on lucene 8 while pyserini switched to lucene 9 in its latest version, so I installed https://pypi.org/project/pyserini/0.16.0/ on a separate conda environment, created a new index with it, but the problem stays the same.

When I tried to build an index with the pyserini version I got from installing bertserini I am stopped by “/home/user/anaconda3/envs/bertserini/bin/python: No module named pyserini.index.lucene“, Only solution i found for that upgrading pyserini which isn‘t an option because of the base bertserini problem.

Is there any easy way around? And sorry if this is a stupid question, but as a psychologist I have a rather weak informatic background knowledge.

edit1: forgot to mention which command I used to create the index python -m pyserini.index.lucene \ --collection JsonCollection \ --input tests/resources/sample_collection_jsonl \ --index indexes/sample_collection_jsonl \ --generator DefaultLuceneDocumentGenerator \ --threads 1 \ --storePositions --storeDocvectors --storeRaw