Closed MikeHopcroft closed 7 years ago
Since documents, maxdocsize, and occurrences are the same, but the chunk index has more terms and postings, my working assumption is that the HtmlDocumentFactory removes duplicate terms.
Another possibility is that I need to pass "-f HtmlDocumentFactory" to GenerateBitFunnelChunks.
Commit 7d2c598 fixes this issue. Now the .properties files from the chunk and the collection have identical counts in all fields.
Chunk file was produced from the first bundle of gov2:
Index from collection was built as follows:
Index from chunk was built as follows:
The process ran to completion, but the properties of the two indexes (out2 is from collection and out3 is from chunk) were different. Notably, the number of documents and occurences and the maxsize are the same in both indexes, but terms, postings, and maxcount are different.
Index built from collection:
Index built from chunk:
Index built from collection:
Index built from chunk: