CdC-SI / eak-copilot

The official repository of the EAK-Copilot project as part of the Innovation Fellowship 2024.
https://cdc-si.github.io/eak-copilot/
GNU General Public License v3.0
4 stars 0 forks source link

implement fuzzy search for autocomplete #149

Closed K-Schubert closed 1 month ago

K-Schubert commented 2 months ago

Description

Implement fuzzy search with Levenstein-Damerau distance based on this Cython implementation (https://github.com/lanl/pyxDamerauLevenshtein).

NOTE: Levenstein-Damerau distance is efficient when there is not too much data. Implemented in Cython.

Fuzzy search is executed in parallel (async) to exact match search and duplicate question suggestions are removed. This implementation of levenstein-damerau intends to solve user query spelling mistakes (eg. Wer ist ... -> Wre ist ...).

All records from the postgres table with FAQ QA pairs are retrieved (postgres doesn't implement fuzzy search. In the future an implementation of fuzzy search with ElasticSearch might be considered). Then Levenstein-Damerau distance is computed for the query and all retrieved records. If the distance is <= 5 (5 or less changes are required to transform the query string and the suggestion), the suggestion is returned.

Implement fuzzy search matching in autocomplete/app/main.py for the /fuzzy_match API endpoint. Update autocomplete/Dockerfile with gcc build for pyxDamerauLevenshtein. Update autocomplete/requirements.txt