Searching multiple tokens in initial search

OHDSI / Hermes

(DEPRECATED) HERMES is a vocabulary browser tool for OMOP CDM v5

http://www.github.com/ATLAS

Apache License 2.0

6 stars 3 forks source link

Searching multiple tokens in initial search #3

Open pbr6cornell opened 9 years ago

pbr6cornell commented 9 years ago

can search something more advanced than substring matching, perhaps doing substring on each token? Ex: search for 'Type 2 diabetes' currently doesn't find 'Diabetes mellitus Type 2'.

fdefalco commented 9 years ago

Are you thinking this would be an "inclusive and" of all tokens found in the search criteria? For example we wouldn't want 'type 2 diabetes' to return all concepts with concept names or concept codes that include the number 2, would we?

pbr6cornell commented 9 years ago

Thats a good point, and I don't know the 'right' solution. i do want to return everything that contains 'type 2 diabetes' and 'diabetes type 2', i suppose i want to deprioritize in the search results findings that only contain 2, though your question rightly asks the binary query, 'to show or not to show', and i suppose a default heuristic that might make sense is

=50% of tokens match...

On Thu, Jan 1, 2015 at 2:25 PM, Frank DeFalco notifications@github.com wrote:

Are you thinking this would be an "inclusive and" of all tokens found in the search criteria? For example we wouldn't want 'type 2 diabetes' to return all concepts with concept names or concept codes that include the number 2, would we?

— Reply to this email directly or view it on GitHub https://github.com/OHDSI/Hermes/issues/3#issuecomment-68495444.

schuemie commented 9 years ago

I don't know if this is overkill, but we could use the same approach used in Usagi: load all concept names (and synonyms) in a Lucene index, and use Lucene to propose matches. It does all (or at least most) of the fancy searching stuff we'd want to support, including Patrick's example, and it's extremely fast.

cgreich commented 9 years ago

Sounds like a good idea to me.

fdefalco commented 9 years ago

We could also consider leveraging the built in full text index capabilities of the database platforms that we support. This would require adding the functionality to SqlTranslate to support the different full text clauses as they are not standard.

This is a good discussion for the next architecture call.

rkboyce commented 9 years ago

+1 for the Lucene approach. Something simpler in the short term might be just to allow passing the SQL wildcards for "LIKE" queries so I could search "%fall%admission%" to get results like "Any falls since admission or prior assessment MDSv3" and "Any fracture related to a fall in the 6 months prior to admission MDSv3"