Open MikeHopcroft opened 7 years ago
Another option is to use the --objectSequence to supply our own DocumentSequence class to IndexBuilder. This DocumentSequence class would read the BitFunnel chunk file. At the very least we would need to implement a DocumentSequence, a DocumentIterator, and a Document.
7d2c5987c38f712a9292efbce7bdd31399590f28 implements ChunkDocumentSequence. This means that we can now create an mg4j index directly from a BitFunnel chunk file. Issue #23 will allow us to build an mg4j index from multiple chunk files listed in a manifest file. This will allow us a workflow where we build a collection corresponding to each gov2 bundle:
GX000/00.txt ==> GX000-00.collection
then build a chunk from each collection:
GX000-00.collection ==> GX000-00.chunk
and then build an index from a manifest of chunks
find GX*.chunk > manifest.txt
The workflow will be as follows:
One option is to make the chunk filtering program emit a file that informs the index build of which documents to incorporate. This could be as simple as a file with one line per document, consisting of the character 'T' or 'F', indicating if the document should be included in the index. This approach would require a modification to the mg4j indexer.
Another approach would be to convert the filtered chunk back into something that looks like a gov2 file and then rerun the mg4j collection builder.