castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.57k stars 349 forks source link

merge a large index with small index \ adding small collection of docs to a large index #1883

Closed tommymordo33 closed 2 months ago

tommymordo33 commented 2 months ago

Hi,

I would like to add a small collection of documents to a large prebuilt index. i tried to use the function add_doc_raw() of LuceneIndexer but i got an exception:

jnius.JavaException: JVM exception occurred: cannot change field "id" from doc values type=SORTED to inconsistent doc values type=BINARY java.lang.IllegalArgumentException

There is another option to make it?

Thanks!

full code

from pyserini.index.lucene import IndexReader, LuceneIndexer
import tarfile
from urllib.request import urlretrieve
r = random.randint(0, 10000000)
collection_url = 'https://github.com/castorini/anserini-data/raw/master/CACM/lucene-index.cacm.tar.gz'
tarball_name = 'lucene-index.cacm-{}.tar.gz'.format(r)
index_dir = 'index{}/'.format(r)
 _,_ = urlretrieve(collection_url, tarball_name)
tarball = tarfile.open(tarball_name)
tarball.extractall(index_dir)
tarball.close()
searcher = SimpleSearcher(f'{index_dir}lucene-index.cacm')
index_utils = IndexReader(f'{index_dir}lucene-index.cacm')
lucene_index = LuceneIndexer(f'{index_dir}lucene-index.cacm', append=True)
x =  '{ "id": "doc99910294", "contents": "this is the content of bob"}'
index_utils.stats()
lucene_index.add_doc_raw(x)
lucene_index.close()