IHTSDO / snowstorm

Scalable SNOMED CT Terminology Server using Elasticsearch
Other
199 stars 79 forks source link

International characters: Diacritics normalization in text search #41

Closed alopezo closed 5 years ago

alopezo commented 5 years ago

The elastic search index does not normalize diacritics, for example, in the spanish edition, using the “findConcepts” API for searching for “vías resp” and “vias resp” (from “vías respiratorias” “respiratory tract” ) produce different results.

Example:

https://snowstorm.msal.gov.ar/MAIN/concepts?activeFilter=true&term=v%C3%ADas%20resp&offset=0&limit=1

https://snowstorm.msal.gov.ar/MAIN/concepts?activeFilter=true&term=vias%20resp&offset=0&limit=1

The browser implementation has a diacritics normalization algorithm on the index creation and search, and spanish users expect that writing the word with or without accent would produce the same results (vía vs via)

Searching on the latest elastic search documentation one way to resolve this is to use multiple fields with different analyzers, and a multi match query with “Most fields”

most_fieldsedit The most_fields type is most useful when querying multiple fields that contain the same text analyzed in different ways. For instance, the main field may contain synonyms, stemming and terms without diacritics. A second field may contain the original terms, and a third field might contain shingles. By combining scores from all three fields we can match as many documents as possible with the main field, but use the second and third fields to push the most similar results to the top of the list.

danka74 commented 5 years ago

Yes, Alejandro, this is the case, but (likely) has to be addressed when indexing and querying. I have done some experiments in this branch: https://github.com/danka74/snowstorm/tree/swedish-experimental-dk but here indexing is hard-coded, which is not what we want, see this commit: https://github.com/danka74/snowstorm/commit/701f0827b73252881ff88851444ee0dafa77f41f /Daniel

alopezo commented 5 years ago

Hi Daniel, that's exactly what is need, you are right. I wonder if a generic analizer that “folds” all accented characters to plain ascii would be good enough for different languages, it would sure be for Spanish. We can provide a list of Spanish Language accented letters that may be added to that configuration. Does this modification affect the results on english language in some way? It would be ideal to add this as the standard way to index and search.

Thanks

danka74 commented 5 years ago

@alopezo , this would unfortunately not work for Swedish with the characters ÅÄÖ which should not be folded. I see that in the few English words where Scandinavian characters are used (like Ångström, the length unit) SNOMED has used a folded term (here Angstrom), so maybe there is a "universal" set of characters which should be folded (e.g. É to E) which excludes ÅÄÖ. /Daniel

alopezo commented 5 years ago

Defining that set would be a great first step, and much simpler to implement as it would not require additional configurations on index/search, for example in spanish we would like to fold:

áéíóúüñ

Maybe we can start with a short list of these and check use cases from other languages.

/Alejandro

kaicode commented 5 years ago

@alopezo Elasticsearch has built in support for appropriate character folding in each language. We plan to add a feature to Snowstorm to allow search to work for all languages where the correct language index analyser is picked at index time using the description language code field.

The correct analyser would also need to be used at search time in some cases. I'm still thinking about the best way to achieve this. Perhaps the Accept-Language header in the search request could be used to select a set of language specific search analysers?

alopezo commented 5 years ago

Yes, this would be a good solution for us, accents folding for Spanish based on the accept-language-header.

I'm reading the documentation of the language analyzers for elastic search:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html

They don't propose folding but it seems like it's a straightforward task to add..

Thanks!

kaicode commented 5 years ago

I'm compiling the list of characters for each language which should not be folded/simplified because they are unique in a that language.

In Swedish the characters which should not be folded are: åäö. In Spanish I think the characters which should not be folded are: áéíóúüñ.

I'm making the assumption that all characters can be made lowercase during processing for search, regardless of diacritics, so we only need to capture the lowercase versions of each character which must not be folded.

danka74 commented 5 years ago

Some more: Danish/Norwegian: å æ ø Finnish: same as Swedish.

Perhaps a request to the Content Managers AG? /Daniel

danka74 commented 5 years ago

Some more: Danish/Norwegian: å æ ø Finnish: same as Swedish.

Perhaps a request to the Content Managers AG? /Daniel

Posted a discussion item on the CMAG discussion page!

kaicode commented 5 years ago

This feature is working so I'll close this ticket. Only Swedish and Spanish characters are in configuration so far. See "Search International Character Handling" in application.properties Looking forward to adding more languages to configuration using another issue or pull request.

CWdanielsen commented 5 years ago

The order of the Danish letters is: æ ø å / Æ Ø Å.

kaicode commented 5 years ago

Thanks @CWdanielsen, I've added these in the develop branch. They will go out in the next release.

CWdanielsen commented 5 years ago

Thanks Kai, and they are the last three letters in the DK alphabet after a-z/A-Z.