BitFunnel / mg4j-workbench

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.
GNU Lesser General Public License v3.0
1 stars 2 forks source link

Chunks generated from Gov2 seem to have lots of duplicate DocId values. #31

Open MikeHopcroft opened 7 years ago

MikeHopcroft commented 7 years ago

Chunks generated from Gov2 seem to have lots of duplicate DocId values. For instance, index-273-100-150 (all 273 gov2 directories, retaining only documents with 100-150 terms) contains 3,870,096 documents, but only 123,544 unique DocIds. When ingesting the documents in Gov2 directory order (GX000, GX001, GX002, ...), the first duplicate is DocId=24 in GX001.