BitFunnel / mg4j-workbench

Java tools for evaluating BitFunnel performance compared to an mg4j baseline.
GNU Lesser General Public License v3.0
1 stars 2 forks source link

Description of query log preprocessing #24

Open MikeHopcroft opened 7 years ago

MikeHopcroft commented 7 years ago

This issue documents our pre-processing of the query logs.

  1. Started with 05.efficiency_topics and 06.efficiency_topics.all from the 2005 and 2006 TREC Terabyte Tracks.

  2. Topic files contain one query per line where each query is preceded by a line number and a colon. The line numbers were removed.

  3. The mg4j and BitFunnel query languages do not allow certain punctuation characters. These were filtered out by replacing matches to the regular expression [-;:'\+] with the empty string. May want to consider replacing punctuation with a space, rather than the empty string.

  4. Query terms were not stemmed.