Collation for non-English languages

IHTSDO / snowstorm

Scalable SNOMED CT Terminology Server using Elasticsearch

Other

204 stars 80 forks source link

Collation for non-English languages #9

Closed danka74 closed 5 years ago

danka74 commented 5 years ago

Hi,

in what way is specification of collation for non-English languages supported?

https://www.elastic.co/guide/en/elasticsearch/guide/master/sorting-collations.html

/Daniel

kaicode commented 5 years ago

Hi Daniel,

Currently the results of a term search in Snowstorm are sorted by length only. Shorter terms should come back first. Alphabetical sorting is not currently implemented in any language so sorting by languages other than english has not been used.

Sorting by length seems to work well. Is this sufficient for you?

Kai

danka74 commented 5 years ago

Hi Kai, this is not so much about sorting (which is relevant as well) as it is about character matching. In Swedish o and ö are distinct characters and should not match, while e.g. in German ö is just a variant (umlaut) of o and here they do match. This is kept in different collation rules for each language. So, this is a quite important function for non-English languages, but it has basic support in elastic, see https://www.elastic.co/guide/en/elasticsearch/guide/master/character-folding.html /Daniel

danka74 commented 5 years ago

For reference, I've added mongodb collations to the sct-snapshot-rest-api in this commit: https://github.com/danka74/sct-snapshot-rest-api/commit/e704c32f7386e636dc2bf19dbe679f5d97394a70

danka74 commented 5 years ago

Did some testing with a local installation of snowstorm. Currently it seems that strings are matched binary not using any collation rules, e.g. searching for "magyar agar" returns no hits whereas "magyar agår" returns 132436001 | Magyar Agår dog breed (organism) | whereas the snapshot-api uses hardcoded folding of characters (e.g. 'å' becomes 'a') if selected.

kaicode commented 5 years ago

Hi @danka74, sorry slow response, I've been away.

Yes, the current behaviour is to not convert any special characters to a simpler form during search but match using the variant given. It sounds like this is not adequate for some languages like German. Thanks for your example to help me understand this.

Although we have the language code to hand when we index Description components it may not be necessary to change the analyser at index time. The simplest approach may be to rely on the request language header and to use a different search analyser based on the language being requested. If terms in the German language are being requested both the exact characters in the search string and the folded version could be used to match descriptions. Matches against the original search characters should probably be given a greater search score. Would that work for you?

kaicode commented 5 years ago

For the record; Daniel and I have started a branch to collaborate on this feature. We will play around to find the best Elasticsearch settings. We have identified that it would be best to set the correct Elasticsearch language analyser at index time. Using the Description language code field during import / component creation to set the analyser is a possible solution. We will continue to play with this as time allows.

kaicode commented 5 years ago

Closing as duplicate of #41 which has had more recent chatter and is now fixed in dev.