Open djoerd opened 1 year ago
To get an accurate count of the vocab size, you have to use the -optimize
flag, which merges all the index segments down into a single one.
Thanks a lot (also for doing this on SIGIR deadline day!): That solves my problem. The optimize flag is costly for a large index though, so the PR may still be helpful. I checked and it gives the exact same number of unique terms for my (non-optimized) Robust04 index: 923436 unique terms, i.e., the Lucene term iterator seems to work correctly on multiple segments.
BTW, I forgot to remove the "terms" declaration from the original getIndexStats() method (should have installed Eclipse directly)
I want to know the number of unique terms in my index and got: -1
Steps: IndexCollection -collection TrecCollection -input /home/hiemstra/Data/robust04/ -index lucene-index.robust04.pos+docvectors -threads 16 -storePositions -storeDocvectors IndexReaderUtils -stats -index lucene-index.robust04.pos+docvectors/
Results: Index statistics
Turns out that: "Terms.size(): (...) may be unavailable (returns -1) for some Terms implementations such as MultiTerms where it cannot be efficiently computed.
I already solved this myself: I will add a pull request.