This issue documents our pre-processing of the query logs.
Started with 05.efficiency_topics and 06.efficiency_topics.all from the 2005 and 2006 TREC Terabyte Tracks.
Topic files contain one query per line where each query is preceded by a line number and a colon. The line numbers were removed.
The mg4j and BitFunnel query languages do not allow certain punctuation characters. These were filtered out by replacing matches to the regular expression [-;:'\+] with the empty string. May want to consider replacing punctuation with a space, rather than the empty string.
This issue documents our pre-processing of the query logs.
Started with 05.efficiency_topics and 06.efficiency_topics.all from the 2005 and 2006 TREC Terabyte Tracks.
Topic files contain one query per line where each query is preceded by a line number and a colon. The line numbers were removed.
The mg4j and BitFunnel query languages do not allow certain punctuation characters. These were filtered out by replacing matches to the regular expression
[-;:'\+]
with the empty string. May want to consider replacing punctuation with a space, rather than the empty string.Query terms were not stemmed.