capreolus-ir / capreolus

A toolkit for end-to-end neural ad hoc retrieval
https://capreolus.ai
Apache License 2.0
95 stars 32 forks source link

when to set is_large_collection True? #133

Closed ali-abz closed 3 years ago

ali-abz commented 3 years ago

Hi there, I am trying to create a new collection of Persian Wikipedia and it contains about 1.4 million paragraphs which I am willing to index and use. I did not found any documentation regarding when to set is_large_collection to True or what it does.

I would appreciate any comments.

andrewyates commented 3 years ago

Hi Ali, currently is_large_collection should always be False. When True, it creates a lean Anserini index (no raw text, term positions, etc), but you will run into trouble when retrieving raw text from the index fails.

ali-abz commented 3 years ago

I see, thanks. I wonder why collections like nf or antique do not set it to True since is_large_collection by definition, is set to False.

andrewyates commented 3 years ago

Sorry, I got that backwards (and edited above). Collections should not be considered large, so they should all set is_large_collection=False.

ali-abz commented 3 years ago

Thanks.