BitFunnel / mg4j-workbench

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.
GNU Lesser General Public License v3.0
1 stars 2 forks source link

Trec terabyte topics contain characters that are illegal in mg4j queres. #15

Closed MikeHopcroft closed 7 years ago

MikeHopcroft commented 7 years ago

The 2006 Trec Terabyte Topics contain the following characters that are illegal in mg4j (and BitFunnel) queries: '-', ';', ':', '\'', and '+'. Right now QueryLogRunner.LoadQueries() replaces each of these characters with a space:

line.replaceAll("[-;:'\\+]", "")

We should preprocess these input files to remove these characters, and then update LoadQueries() to remove the regex code.

MikeHopcroft commented 7 years ago

Commit ebfee71f41e28d333daf4964b29cc5fcaab2e42c removed commas as well. These aren't a problem for the query parser, but the would require escaping in the query performance results output file which is csv format.

MikeHopcroft commented 7 years ago

Also removing '/' and coalescing multiple spaces into one.