Closed sidharthramesh closed 4 years ago
Hi @sidharthramesh,
Our public instance is currently using a 8G 2 core linux machine for the Snowstorm application but uses a cluster of two 8g nodes on AWS Elasticsearch. On the public instance the average fetch time for the minimal format concept is around 140ms: https://browser.ihtsdotools.org/snowstorm/snomed-ct/MAIN/2020-03-09/concepts/80631005 For a full concept with all descriptions, axioms and relationships is around 200ms: https://browser.ihtsdotools.org/snowstorm/snomed-ct/browser/MAIN/2020-03-09/concepts/80631005
We are using Elasticsearch on AWS because of the automated backups and easy management for our DevOps team.
I usually find that hosting a single Elasticsearch node on the same machine as Snowstorm makes the API requests 1.5 - 2 times faster. The main thing to ensure is that Elasticsearch has enough memory. If you have a single machine with 8G I would give Elasticsearch 3G and Snowstorm 2G. Leaving 3G ram free on the machine is recommended by the Elasticsearch team because it will be used by OS level disk caching to get the best performance from Elasticsearch.
Related reading:
I hope that helps. Kind regards, Kai
I'm trying to make an autosuggest feature that automatically prompts the user for related concepts as they are typing. The GET /{branch}/concepts endpoint was the closest I found to that. The performance of my instance has improved considerably after disabling Swap and setting the memory as you mentioned.
However, searching for terms (without relations and descriptions) even on the public instance still takes a lot of time: https://browser.ihtsdotools.org/snowstorm/snomed-ct/MAIN/concepts?term=cat&offset=0&limit=50
This one takes more than a second:
The TTFB alone is taking about 1.17s.
So I did try to debug and used a reverse proxy to see all the requests that are being made to Elasticsearch on my local machine. It took about 9.6 secs for the same search above (although it's my laptop running elasticsearch on docker, I'm not surprised) :
Found out that for each terminology search, multiple requests are being made with some being really costly. I have included all the slow requests at the end. My question is, can this be improved? Is there a better API to make something like autosuggest?
{
"from": 0,
"query": {
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"filter": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"must": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"should": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"filter": [
{
"simple_query_string": {
"analyze_wildcard": false,
"boost": 1.0,
"default_operator": "and",
"fields": [
"termFolded^1.0"
],
"flags": -1,
"query": "cat*"
}
}
]
}
},
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"filter": [
{
"simple_query_string": {
"analyze_wildcard": false,
"boost": 1.0,
"default_operator": "and",
"fields": [
"termFolded^1.0"
],
"flags": -1,
"query": "cat*"
}
}
],
"must": [
{
"term": {
"languageCode": {
"boost": 1.0,
"value": "no"
}
}
}
]
}
},
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"filter": [
{
"simple_query_string": {
"analyze_wildcard": false,
"boost": 1.0,
"default_operator": "and",
"fields": [
"termFolded^1.0"
],
"flags": -1,
"query": "cat*"
}
}
],
"must": [
{
"term": {
"languageCode": {
"boost": 1.0,
"value": "fi"
}
}
}
]
}
},
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"filter": [
{
"simple_query_string": {
"analyze_wildcard": false,
"boost": 1.0,
"default_operator": "and",
"fields": [
"termFolded^1.0"
],
"flags": -1,
"query": "cat*"
}
}
],
"must": [
{
"term": {
"languageCode": {
"boost": 1.0,
"value": "sv"
}
}
}
]
}
},
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"filter": [
{
"simple_query_string": {
"analyze_wildcard": false,
"boost": 1.0,
"default_operator": "and",
"fields": [
"termFolded^1.0"
],
"flags": -1,
"query": "cat*"
}
}
],
"must": [
{
"term": {
"languageCode": {
"boost": 1.0,
"value": "fr"
}
}
}
]
}
},
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"filter": [
{
"simple_query_string": {
"analyze_wildcard": false,
"boost": 1.0,
"default_operator": "and",
"fields": [
"termFolded^1.0"
],
"flags": -1,
"query": "cat*"
}
}
],
"must": [
{
"term": {
"languageCode": {
"boost": 1.0,
"value": "da"
}
}
}
]
}
},
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"filter": [
{
"simple_query_string": {
"analyze_wildcard": false,
"boost": 1.0,
"default_operator": "and",
"fields": [
"termFolded^1.0"
],
"flags": -1,
"query": "cat*"
}
}
],
"must": [
{
"term": {
"languageCode": {
"boost": 1.0,
"value": "es"
}
}
}
]
}
}
]
}
}
]
}
}
],
"must": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"must": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"must": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"should": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"must": [
{
"term": {
"path": {
"boost": 1.0,
"value": "MAIN"
}
}
},
{
"range": {
"start": {
"boost": 1.0,
"from": null,
"include_lower": true,
"include_upper": true,
"to": 1584685935588
}
}
}
],
"must_not": [
{
"exists": {
"boost": 1.0,
"field": "end"
}
}
]
}
}
]
}
}
]
}
}
]
}
}
]
}
},
"size": 10000,
"sort": [
{
"termLen": {
"order": "asc"
}
},
{
"_score": {
"order": "asc"
}
}
],
"stored_fields": [
"descriptionId",
"conceptId"
]
}
{
"from": 0,
"query": {
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"must": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"must": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"must": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"should": [
{
"bool": {
"adjust_pure_negative": true,
"boost": 1.0,
"must": [
{
"term": {
"path": {
"boost": 1.0,
"value": "MAIN"
}
}
},
{
"range": {
"start": {
"boost": 1.0,
"from": null,
"include_lower": true,
"include_upper": true,
"to": 1584685935588
}
}
}
],
"must_not": [
{
"exists": {
"boost": 1.0,
"field": "end"
}
}
]
}
}
]
}
}
]
}
}
]
}
},
{
"terms": {
"additionalFields.acceptabilityId": [
"900000000000548007",
"900000000000549004"
],
"boost": 1.0
}
},
{
"terms": {
"boost": 1.0,
"conceptId": [
"60231008",
"422860002",
"782515007",
"726762008",
"62795009",
"217701002",
"253253009",
"46540009",
"423247009",
"37473008",
"86714001",
"128306009",
"90268004",
"19923001",
"33384004",
"17738004",
"396747005",
"63129006",
"54988005",
"388623001",
"100141008",
"14060003",
"85491003",
"386051003",
"77477000",
"282673009",
"61698003",
"96257008",
"256425001",
"257528009",
"425154009",
"79058000",
"157937004",
"63852007",
"275281000",
"24275002",
"388618001",
"423717008",
"266383007",
"193570009",
"41932008",
"409920005",
"204259006",
"31046007",
"30623001",
"388626009",
"23826000",
"227043000",
"155126003",
"155521003"
]
}
}
]
}
},
"size": 10000
}
{
"_source": {
"excludes": [],
"includes": [
"conceptId"
]
},
"from": 0,
"post_filter": {
"terms": {
"boost": 1.0,
"conceptId": [
257528009,
388618001,
33384004,
388623001,
388626009,
23826000,
30623001,
46540009,
63852007,
90268004,
79058000,
31046007,
24275002,
266383007,
204259006,
256425001,
253253009,
155521003,
227043000,
275281000,
282673009,
425154009,
396747005,
63129006,
86714001,
100141008,
96257008,
17738004,
41932008,
54988005,
726762008,
782515007,
157937004,
155126003,
193570009,
217701002,
...too long to post
Docker is really killing your performance there.
In the first Elasticsearch query, which is against the description index, there is a clause for each language that is configured with special character folding. This is to allow the search to work as expected in multiple languages. If you are only interested in English (or any language which does not need this character folding feature) you could try removing everything starting 'search.language' from the configuration in this section of your local instance to see if that helps at all: application.properties Search International Character Handling This should simplify the first query but it's unlikely that will speed things up much.
The cost usually comes from the number of times a request is made to Elasticsearch. The design of the Snowstorm indices allow for:
For this reason the information for each concept is not denormalised into a single Elasticsearch document, it is spread over several indices in a similar way to the RF2 distribution files of SNOMED CT. This means that to fulfil a search request many requests need to be made to match a description and then gather all the information returned. The member index request is needed to work out which description is the FSN and PT for the matched concept in the language you have requested.
You could also try the description endpoint: https://browser.ihtsdotools.org/snowstorm/snomed-ct/MAIN/descriptions?term=cat
If you need something superfast but very simple in a single language for a single release of SNOMED CT I would consider using something which uses Lucene directly. There is a starter project which you may be interested in, it's something I created before Snowstorm, it's not really in active maintenance at the moment: SNOMED Query Service It's not mentioned in the readme but the REST API can accept a term parameter.
I hope that helps. Kind regards, Kai
Good luck with your project! 😄
Thank you! I am now working on using elasticsearch directly so as to reduce the time taken per search. This is a great project when considering the ECL query implementation!
Each query, including looking up by concept code takes ~1 second or more. The minimum response time I was able to get with an 8GB RAM, 4 core dedicated Linux machine running snowstorm was on average 600ms, which is still a lot. Is this the expected performance? Are there any ways or best practices to optimize the speed of search?
Also, how does caching work exactly? I see snowstorm output to the console "Caches are hot", but the search speeds for repeated searches are still the same.