BitFunnel / mg4j-workbench

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.
GNU Lesser General Public License v3.0
1 stars 2 forks source link

Collection-building pipeline should be based on gz files. #7

Open MikeHopcroft opened 7 years ago

MikeHopcroft commented 7 years ago

Our collection-building pipeline should be based on gz files instead of uncompressed .txt files. There are too many .txt files to pass as command-line arguments (Windows limit for command line is 8191 characters).

This mainly involves updates to shell scripts and the README.md to use the -z flag.

MikeHopcroft commented 7 years ago

On further investigation, it seems the the -z flag to the mg4j collection builder expects the 27,204 document bundles as input, instead of tar.gz files of the directories of bundles.

It looks as though this may not be a problem for the command line. The reason is that the list of file names is piped into the collection builder's stdin. For more information, see A TREC Index in the mg4j documentation.

This is good news as it means that we can extract each directory from its .7z file and then recompress each of the 27,204 bundles with gzip.