TheDataStation / ver

Data Discovery Tools and Systems
MIT License
5 stars 9 forks source link

Fix FTS incomplete result and implement enable/disable exact search #61

Closed kevindharmawan closed 8 months ago

kevindharmawan commented 9 months ago

This PR will:

  1. Fix FTS problem in #59 (explanation below).
  2. Implement enable/disable exact search in fulltext_index_duckdb.py's fts_query function.

With stopwords='english', DuckDB's FTS index will remove stopwords before storing the searchable data. But, when the client send a keyword to find, the stopwords in the keyword is not removed. For example, "University of Chicago" will be stored as "University Chicago" (the actual implementation can be a bit different), but querying "University of Chicago" will not be read as "University Chicago" and FTS will return empty result. On the contrary, with stopwords='none', "University of Chicago" will still be stored as "University of Chicago" and querying "University Chicago" will still find "University of Chicago".

snowgy commented 9 months ago

This looks good! BTW, have you investigated the score produced by match_bm25? Are there still negative values and can we rank candidates based on that score?

kevindharmawan commented 9 months ago

have you investigated the score produced by match_bm25? Are there still negative values and can we rank candidates based on that score?

Unfortunately, there's still negative values and I don't think the score is reliable to be used for ranking.