BitFunnel / mg4j-workbench

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.
GNU Lesser General Public License v3.0
1 stars 2 forks source link

Index built directly from chunk differs from index built from collection. #14

Closed MikeHopcroft closed 7 years ago

MikeHopcroft commented 7 years ago

Chunk file was produced from the first bundle of gov2:

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     it.unimi.di.big.mg4j.document.TRECDocumentCollection ^
     -f HtmlDocumentFactory -p encoding=iso-8859-1 ^
    d:\data\work\out2.collection d:\data\gov2\gx000\gx000\00.txt
java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     org.bitfunnel.reproducibility.GenerateBitFunnelChunks ^
     -S d:\data\work\out2.collection d:\data\work\out2.chunk

Index from collection was built as follows:

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     it.unimi.di.big.mg4j.tool.IndexBuilder ^
      --keep-batches --downcase -S d:\data\work\out2.collection d:\data\work\out2

Index from chunk was built as follows:

java -cp target/mg4j-1.0-SNAPSHOT-jar-with-dependencies.jar ^
     it.unimi.di.big.mg4j.tool.IndexBuilder ^
     -o org.bitfunnel.reproducibility.ChunkDocumentSequence(d:\data\work\out2.chunk) ^
     d:\data\work\out3

The process ran to completion, but the properties of the two indexes (out2 is from collection and out3 is from chunk) were different. Notably, the number of documents and occurences and the maxsize are the same in both indexes, but terms, postings, and maxcount are different.

Index built from collection:

d:\data\work>type out2-text.properties
documents=1092
terms=21604
postings=158659
maxcount=652
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.DowncaseTermProcessor
batches=1
field=text
size=4417176
maxdocsize=10622
occurrences=295570

Index built from chunk:

d:\data\work>type out3-text.properties
documents=1092
terms=292005
postings=295570
maxcount=1
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.NullTermProcessor
batches=1
field=text
size=11473690
maxdocsize=10622
occurrences=295570

Index built from collection:

D:\data\work>more out2-title.properties
documents=1092
terms=1631
postings=5740
maxcount=4
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.DowncaseTermProcessor
batches=1
field=title
size=96009
maxdocsize=27
occurrences=5904

Index built from chunk:

D:\data\work>more out3-title.properties
documents=1092
terms=4725
postings=5904
maxcount=1
indexclass=it.unimi.di.big.mg4j.index.QuasiSuccinctIndex
skipquantum=256
byteorder=LITTLE_ENDIAN
termprocessor=it.unimi.di.big.mg4j.index.NullTermProcessor
batches=1
field=title
size=154267
maxdocsize=27
occurrences=5904
MikeHopcroft commented 7 years ago

Since documents, maxdocsize, and occurrences are the same, but the chunk index has more terms and postings, my working assumption is that the HtmlDocumentFactory removes duplicate terms.

MikeHopcroft commented 7 years ago

Another possibility is that I need to pass "-f HtmlDocumentFactory" to GenerateBitFunnelChunks.

MikeHopcroft commented 7 years ago

Commit 7d2c598 fixes this issue. Now the .properties files from the chunk and the collection have identical counts in all fields.