BitFunnel / mg4j-workbench

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.
GNU Lesser General Public License v3.0
1 stars 2 forks source link

Index-build needs to filter out documents not in the BitFunnel shard. #8

Open MikeHopcroft opened 7 years ago

MikeHopcroft commented 7 years ago

The workflow will be as follows:

  1. Create a collection from some subset of gov2.
  2. Create a BitFunnel chunk from this collection.
  3. Filter the BitFunnel chunk to include only those documents slated for a particular shard.
  4. Somehow build an mg4j index that contains only those documents.

One option is to make the chunk filtering program emit a file that informs the index build of which documents to incorporate. This could be as simple as a file with one line per document, consisting of the character 'T' or 'F', indicating if the document should be included in the index. This approach would require a modification to the mg4j indexer.

Another approach would be to convert the filtered chunk back into something that looks like a gov2 file and then rerun the mg4j collection builder.

MikeHopcroft commented 7 years ago

Another option is to use the --objectSequence to supply our own DocumentSequence class to IndexBuilder. This DocumentSequence class would read the BitFunnel chunk file. At the very least we would need to implement a DocumentSequence, a DocumentIterator, and a Document.

MikeHopcroft commented 7 years ago

7d2c5987c38f712a9292efbce7bdd31399590f28 implements ChunkDocumentSequence. This means that we can now create an mg4j index directly from a BitFunnel chunk file. Issue #23 will allow us to build an mg4j index from multiple chunk files listed in a manifest file. This will allow us a workflow where we build a collection corresponding to each gov2 bundle:

GX000/00.txt ==> GX000-00.collection

then build a chunk from each collection:

GX000-00.collection ==> GX000-00.chunk

and then build an index from a manifest of chunks

find GX*.chunk > manifest.txt