clowder-framework / clowder

A data management system that allows users to share, annotate, organize and analyze large collections of datasets. It provides support for extensible metadata annotation using JSON-LD and a distribute analytics event bus for automatic curation of uploaded data.
https://clowderframework.org/
University of Illinois/NCSA Open Source License
34 stars 17 forks source link

String with dashes don't indexed properly #359

Open lmarini opened 2 years ago

lmarini commented 2 years ago

When creating a resource with a dash in the name (a dataset for example), search doesn't find the resource, even though a query directly against elasticsearch will find it. The assumption is Clowder is escaping dashes in such a way that it can't match it. Underscores work instead and are work around.

To Reproduce Steps to reproduce the behavior:

  1. Create a dataset named "Test-Dashes"
  2. Search for it in the gui. No results, whether it's in quotes or not
  3. Searching elasticsearch directly finds it http://localhost:9200/clowder/_search?q=Test-Dashes

Clowder v1.20.2

max-zilla commented 1 year ago

Spent some time on this, issue I believe is with the dashed term being split into multiple tokens in ES (it treats the - as a stop character) resulting in weird behavior on search evaluation. I was able to find the dataset using "Dashes", for example, but not "Test-Dashes" or "Test Dashes".

Tried changing default tokenizer to whitespace tokenizer and a couple other tokenizers suggested by ES, but they were not returning results as I expected. The only other idea I had was to change the query we build (from "query_string" to something else) but that will likely have ripple effects on the behavior of other cases we've encountered in the past and possibly cause other things to break so I am hesitant.