castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 457 forks source link

Unique terms not available in IndexReaderUtils #2052

Open djoerd opened 1 year ago

djoerd commented 1 year ago

I want to know the number of unique terms in my index and got: -1

Steps: IndexCollection -collection TrecCollection -input /home/hiemstra/Data/robust04/ -index lucene-index.robust04.pos+docvectors -threads 16 -storePositions -storeDocvectors IndexReaderUtils -stats -index lucene-index.robust04.pos+docvectors/

Results: Index statistics

documents:             528030
documents (non-empty): 528030
unique terms:          -1
total terms:           174540872

Turns out that: "Terms.size(): (...) may be unavailable (returns -1) for some Terms implementations such as MultiTerms where it cannot be efficiently computed.

I already solved this myself: I will add a pull request.

lintool commented 1 year ago

To get an accurate count of the vocab size, you have to use the -optimize flag, which merges all the index segments down into a single one.

djoerd commented 1 year ago

Thanks a lot (also for doing this on SIGIR deadline day!): That solves my problem. The optimize flag is costly for a large index though, so the PR may still be helpful. I checked and it gives the exact same number of unique terms for my (non-optimized) Robust04 index: 923436 unique terms, i.e., the Lucene term iterator seems to work correctly on multiple segments.

BTW, I forgot to remove the "terms" declaration from the original getIndexStats() method (should have installed Eclipse directly)