Chemical-Curation / chemcurator_django

Backend services for chemical curation
https://api.chemreg.epa.gov
MIT License
1 stars 0 forks source link

Misspelled resolution #208

Open cmgrulke opened 4 years ago

cmgrulke commented 4 years ago

As a user I want to be able to able to find chemicals even if I misspell the name So that I can be error prone in my work, but still be successful in finding chemicals on ChemReg

GET https://api.chemreg.epa.gov/search?identifer=$$$$$$$

misspelled names like "Caffiene" still return the correct record

Acceptance Criteria

mwfrost commented 3 years ago

https://www.postgresql.org/docs/current/functions-textsearch.html

mwfrost commented 3 years ago

More relevant than the textsearch: https://www.postgresql.org/docs/current/fuzzystrmatch.html

mwfrost commented 3 years ago

After loading the fuzzystrmatch extension with CREATE EXTENSION fuzzystrmatch, this query works as hoped:

SELECT substances.id AS substances_id, 
substances.identifiers AS substances_identifiers,
levenshtein(substances.identifiers ->> 'preferred_name', 'Hydrogen Peroxide') as dist_match,
levenshtein(substances.identifiers ->> 'preferred_name', 'Hydrgoen Peroxide') as dist_close,
levenshtein(substances.identifiers ->> 'preferred_name', 'Foodrygen Baroxide') as dist_bad
FROM substances 
WHERE (substances.identifiers ->> 'preferred_name' = 'Hydrogen Peroxide' ) 

The distance expressions return 0, 2, and 6.

I don't know what the performance cost will be to apply this to every record in the table before the WHERE clause returns the results for scoring, but I think that's what we have to do.