Autosuggest is not returning correct results for /

KateMashkinaNIH commented 4 years ago

Issue description

Autosuggest API is not retuning the results that contains 2/3 or /cd - instead it matches against 2, 3 and just cd

ESTIMATE 20

Steps to reproduce the issue

Go to https://webapis-dev.cancer.gov/drugdictionary/v1/
Select autosuggest, add query param : 2/3, matchType Contains, includeNameTypes: "PreferredName", "Synonym", "USbrandname", "Codename", "Chemicalstructurename", "Abbreviation", "Foreignbrandname", "CASRegistryname","NSCnumber", "Lexicalvariant" and
Verify returned results

What's the expected result?

the response should only include terms that contain 2/3

What's the actual result?

response returns all terms that contains either 2 or 3

Additional details / screenshot

Related Tickets

Issue #9999
Issue #9999

blairlearn commented 4 years ago

@KateMashkinaNIH - Please verify that this was addressed by PR #30.

blairlearn commented 4 years ago

per @zhuomingao (via Slack)

we need to change the drug index mapping to use classic tokenizer instead of standard tokenizer, the new mapping is here.

{
    "settings": {
      "index": {
        "number_of_shards": "1",
        "analysis": {
          "filter": {
            "autocomplete_filter": {
              "type": "edge_ngram",
              "min_gram": 1,
              "max_gram": 30,
              "token_chars": [
                "letter",
                "digit",
                "punctuation",
                "symbol"
              ]
            },
            "ngram_filter": {
              "type": "ngram",
              "min_gram": 1,
              "max_gram": 30,
              "token_chars": [
                "letter",
                "digit",
                "punctuation",
                "symbol"
              ]
            }
          },
          "analyzer": {
            "autocomplete_index": {
              "type": "custom",
              "tokenizer": "classic",
              "filter": [
                "lowercase",
                "autocomplete_filter",
                "asciifolding"
              ]
            },
            "lowercase_search": {
              "filter": [
                "lowercase",
                "asciifolding"
              ],
              "tokenizer": "keyword"
            },
            "ngram_analyzer": {
              "type": "custom",
              "tokenizer": "keyword",
              "filter": [
                "lowercase",
                "ngram_filter",
                "asciifolding"
              ]
            },
            "autocomplete_search": {
              "type": "custom",
              "tokenizer": "classic",
              "filter": [
                "lowercase",
                "asciifolding"
              ]
            }
          },
          "normalizer": {
            "caseinsensitive_normalizer": {
              "type": "custom",
              "char_filter": [],
              "filter": [
                "lowercase",
                "asciifolding"
              ]
            }
          }
        }
      }
    },
    "mappings": {
      "terms": {
        "dynamic": "strict",
        "properties": {
          "name": {
            "type": "keyword",
            "normalizer": "caseinsensitive_normalizer",
            "fields": {
              "_autocomplete": {
                "type": "text",
                "analyzer": "autocomplete_index",
                "search_analyzer": "autocomplete_search"
              },
              "_contain": {
                "type": "text",
                "analyzer": "ngram_analyzer",
                "search_analyzer": "lowercase_search"
              }
            }
          },
          "type": {
            "type": "keyword"
          },
          "term_name_type": {
            "type": "keyword"
          },
          "first_letter": {
            "type": "keyword",
            "normalizer": "caseinsensitive_normalizer"
          },
          "preferred_name": {
            "type": "keyword",
            "normalizer": "caseinsensitive_normalizer"
          },
          "aliases": {
            "type": "nested",
            "include_in_root": true,
            "properties": {
              "type": {
                "type": "keyword"
              },
              "name": {
                "type": "keyword",
                "normalizer": "caseinsensitive_normalizer",
                "fields": {
                  "_contain": {
                    "type": "text",
                    "analyzer": "ngram_analyzer",
                    "search_analyzer": "lowercase_search"
                  }
                }
              }
            }
          },
          "definition": {
            "properties": {
              "html": {
                "type": "keyword"
              },
              "text": {
                "type": "keyword"
              }
            }
          },
          "term_id": {
            "type": "long"
          },
          "pretty_url_name": {
            "type": "keyword"
          },
          "nci_concept_id": {
            "type": "keyword"
          },
          "nci_concept_name": {
            "type": "keyword"
          },
          "drug_info_summary_link": {
            "properties": {
              "text": {
                "type": "keyword"
              },
              "url": {
                "type": "keyword"
              }
            }
          }
        }
      }
    }
  }

blairlearn commented 4 years ago

@zhuomingao - there's still a problem where a "contains" search text which starts with a / matches terms without one.

Example: A "contains" search for /cd matches (among other things) the drugterm "allogeneic CD123-specific universal CAR123-expressing T lymphocytes"

Neither the term nor any of its aliases contains a / character.

  "aliases": [
    {
      "type": "CodeName",
      "name": "UCART123"
    },
    {
      "type": "Synonym",
      "name": "UCART123 T cells"
    },
    {
      "type": "Synonym",
      "name": "universal chimeric antigen receptor T cell 123"
    },
    {
      "type": "Synonym",
      "name": "universal TALEN gene-edited CART123 cells"
    },
    {
      "type": "Synonym",
      "name": "allogeneic engineered T cells expressing anti-CD123 chimeric antigen receptor"
    },
    {
      "type": "Synonym",
      "name": "universal chimeric antigen receptor T cells targeting CD123"
    }
  ],

blairlearn commented 4 years ago

Per conversation with @zhuomingao, this is because we are treating / as a delimiter.

blairlearn commented 4 years ago

The current production system includes the / in the search criteria; however the glossary API is also treating / as a delimiter and not part of the search text.

After conversation with @lburack, we have decided to accept this difference from the current system.

NCIOCPL / drug-dictionary-api