DataBiosphere / data-explorer

BSD 3-Clause "New" or "Revised" License
10 stars 6 forks source link

Search strings should also work for variable substrings #297

Closed wnojopra closed 5 years ago

wnojopra commented 5 years ago

Nicole brought this up - in Baseline searching for 'urine' doesn't match variables like standard_labs_albumin_urine_mgl or standard_labs_creatinine_urine_mgdl. Another example is in AMP PD, searching for 'territory' doesn't match genome_territory.

A quick workaround while this feature doesn't exist is to add the substrings to the column descriptions.

melissachang commented 5 years ago

I assume we're using Standard Tokenizer. Might be worth using a different tokenizer that also splits on underscores. For standard_labs_albumin_urine_mgl, it would be nice if the terms were ['standard_labs_albumin_urine_mgl', 'standard', 'labs', 'albumin', 'urine', 'mgl'].

Testing should include NHS, UKBB.

wnojopra commented 5 years ago

0) Tokenizers are objects under analyzers, and we currently aren't specifying an analyzer. 1) Elasticsearch will use the standard analyzer by default. This will not parse out underscores. The simple analyzer however, will. 2) We need to specify what analyzer to use in the mappings. We specify a mapping for the main index, but not the fields index. We need it for both.

I have https://github.com/DataBiosphere/data-explorer-indexers/compare/wn/underscore_fields?expand=1 in progress that adds the simple analyzer to both indexes. I've tested with 1000 genomes and amp pd, and will test with NHS.

melissachang commented 5 years ago

Can you also test with Baseline internal, thanks