Closed brentgriffin closed 6 years ago
As I was writing up the details here, it occurred to me that the timeout setting on the tool may play a factor here. I will increase that setting to see what if any impact that has on the results. Either way, 4.3.2 was run with the same settings and did not manifest the problems.
In looking at the tool, the setting I was thinking of is not relevant here - so no additional data to add at this time.
If it helps, here is some glowroot output for the contact-us and root pages. Under load they seem to be the pages that are slowing down the most. https://gist.github.com/brentgriffin/5b17dea5f8c42d2bae1820f9ab9fef49
you have to copy paste these files into an html file and then open inbrowser to see profile info
One issue I see is that the KeyValueAPIImpl does not use a 404 cache, and so hits ES very hard - the contact-us page uses this for form labels.
Based on the committed changes, I wanted to rerun my tests and compare the results. I ran this on my local Ubuntu 16.04 machine (bare metal) without Docker. I am using a build of commit 0e6c4a3 from the master branch - it was the latest commit available when I started this testing yesterday. Running with Postgres 9.5
Things are better than they were but the average response time per requests continues to increase slowly even after 7.5 hours in this test. (in another test i have running on aws in Docker and Postgres 9.6, it is continuing to increase after 16 hours)
My theory after initially filing this issue was that threads were queueing up for ES searchIndex and not getting processed in order so some requests were never getting processed or only after significant delays. Based on the data from this longer run, I believe that this is proved to be true. Here are threaddumps taken while dotcms was still under load. The thread dumps were taken every minute for a period of five minutes. The final threaddump was taken at 7:32 this morning and shows these threads waiting from these times:
7:18:11 - 3 Threads
41 Threads waiting for ES index search between 14 minutes and 7 hours. Their stacks look like this:
- parking to wait for <0x0000000XXXXXXXXX> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:251)
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:94)
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:39)
at com.dotcms.content.elasticsearch.business.ESContentFactoryImpl.indexSearch(ESContentFactoryImpl.java:1419)
at com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.searchIndex(ESContentletAPIImpl.java:816)
at com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.search(ESContentletAPIImpl.java:678)
at com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.search(ESContentletAPIImpl.java:669)
As you can see from these threaddumps, once the load was stopped, the threads finished processed and were no longer waiting. So, in my thinking, this proves that when dotCMS is queuing up ES reqests that those requests are not processed anywhere near the order they enter the queue. Perhaps this is a known ES issue or the way we are invoking ES. It seems that some requests may not ever be processed until the load on ES is decreased - at the very least after some very long delays.
Oh, one other thing. JVM Heap was set to 4 GB and GC was not a problem. There had been 0 full GCs and only 3 to 4 seconds per minute were spend doing minor collections.
Tested commit with fixes 7a060f9 - still shows the same problems.
Currently testing with the commit prior to the ES upgrade 7fa4c7f to verify that it definitely related to the ES upgrade. Should have official results in the morning but based on the first 10 minutes I do not believe that it is going to have the same problem.
As you can see commit 7fa4c7f does NOT have this problem. It ran for over 8.5 hours and its response time is incredibly consistent.
Also of note is the cpu usage. In the tests that have problems all 8 cores are maxed at 100% utilization, but in this test that does not have a problem the 8 cores are averaging less than 25% utilization. This leads me to believe two things:
I attempted to recreate in the commit prior to the ES upgrade (7fa4c7f) and could not even with additional load added to the current load. (added apache benchmark hitting homepage - 20 remote & 10 local)
In the current build (7b64b93), glowroot shows that delays for long transactions being that they are waiting on ES IndexSearch.
Using the ES REST API, in both builds, I can see a significant difference in search queue size. For current build, I regularly see all 50 clients waiting on ES - there are 13 max threads and all 13 threads are always searching and it is not unusual to see 30+ threads in the search queue. In the old build (7fa4c7f), it is not uncommon to see single digit (including 0) active searches and I have never seen the waiting queue above 8. Usually, the waiting queue is 0.
Based on the information I have at this point, it seems our ES searches are taking considerably longer than they used to. My current thought is to see if I can find a specific ES query via the ES REST API that is considerably slower in the current build. If I can find such a query, it would remove our ES client code from consideration and give a much simpler test case for troubleshooting than we currently have.
Thoughts? If so, please share them.
After adding the following slowlog configuration on ESIndexAPI.java:
map.put("index.search.slowlog.threshold.query.warn", "1s");
map.put("index.search.slowlog.threshold.query.info", "1s");
map.put("index.search.slowlog.threshold.query.debug", "1s");
map.put("index.search.slowlog.threshold.query.trace", "1s");
map.put("index.search.slowlog.threshold.fetch.warn", "1s");
map.put("index.search.slowlog.threshold.fetch.info", "1s");
map.put("index.search.slowlog.threshold.fetch.debug", "1s");
map.put("index.search.slowlog.threshold.fetch.trace", "1s");
map.put("index.search.slowlog.level", "info");
I found some slow queries: https://gist.github.com/nollymar/42fd03560f9bc3bbc0f42035962fa611
Nice - reviewing slow queries now
I was making the setting via the ES REST API - for future ref:
curl -X PUT "http://localhost:9200/INDEXNAME/_settings?pretty" -H "Content-Type:application/json" -d '{"index.search.slowlog.threshold.query.warn": "1s"}'
Just have to replace INDEXNAME with the name of your index - this requires the ES HTTP setting to be true but is nice in that it can change config at runtime
Additionally, after running curl localhost:9200/_nodes/hot_threads
several times, I found this: https://gist.github.com/nollymar/c0768fd408ac55942d152b55cd7b0e01
Another finding: http://localhost:9200/_nodes/stats?pretty
shows the miss_count increasing during performance tests, which makes me think that request queries are not being cached correctly. After running locust for 10 minutes I got these results:
"request_cache" : {
"memory_size_in_bytes" : 20943,
"evictions" : 0,
"hit_count" : 207,
"miss_count" : 43
}
I have captured the same queries in both versions for comparison - https://gist.github.com/brentgriffin/dd52f1da1cfb49f3c88d76a590b6e493
I am an ES newbie so I am not sure if the differences are significant or not but I will highlight them here so others can weigh in. The two differences that seem they may be significant to me are:
Previously "_cache:true" was included on each query but not in current version
Previously "inode" and "identifier" were included in the fields of the query but the current version includes those in "_source" instead
@brentgriffin I think your last comment is aligned to the request cache issue I found. Regarding your comment: "Previously "inode" and "identifier" were included in the fields of the query but the current version includes those in "_source" instead", it's true because in ES 6 those fields have to be read from _source, it's related to my last commit: https://github.com/dotCMS/core/commit/1fc8b638daf04bf254e08a6c948096f938d87a44. Previously, we're retrieving all fields which was inefficient.
In ES 6, fields array is used only for stored fields https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-stored-fields.html. We're not using stored fields because it's not recommended, in the doc they suggest to get fields from _source instead.
Scoring calculation is taking a lot of time. I think we need to verify which queries really need scoring and prefer filters instead of queries. Many of them will be cached and the performance improved: https://opensourceconnections.com/blog/2017/07/10/caching_in_elasticsearch/ https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-filter-context.html
This query is taking almost 3 seconds to execute:
14:22:51.144 WARN slowlog.query - [live_20180419141207][0] took[2.9s], took_millis[2978], types[],
stats[], search_type[QUERY_THEN_FETCH], total_shards[1], source[{"size":5,"query":{"query_string":
{"query":"+structurename:news +(conhost:48190c8c-42c4-46af-8d1a-0cd5db894797
conhost:system_host) tags:fund*^3 tags:asset allocation*^2 tags:earnings*^2 tags:long ideas*^2
tags:index*^2 tags:long & short ideas*^2 tags:services*^2 tags:jewelry stores*^2 +languageid:1
+live:true +deleted:false +((+owner:anonymous +ownercanread:true)
(permissions:pedecd377-2321-4803-aa8b-89797dd0d61f.1p* permissions:p654b0931-1027-41f7-ad4d-173115ed8ec1.1p* )
(permissions:p654b0931-1027-41f7-ad4d-173115ed8ec1.1p*) ) +live:true","fields":
[],"type":"best_fields","default_operator":"or","max_determinized_states":10000,"
enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,
"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,
"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}},
"_source":{"includes":["inode","identifier"],"excludes":[]},"sort":[{"_score":{"order":"desc"}},
{"news.syspublishdate":{"order":"desc"}}]}],
This PR improves the performance of personalization queries: https://github.com/dotCMS/core/pull/14172
Note to QA: Personas feature should be tested after this fix because we were not filtering content with the right tags. This issue is happening in 4.x versions too.
Good work, Nolly! This seems to be the fix we were looking for. Our degrading ES performance issue seems to be resolved.
I verified the performance improvement of this fix by building and testing commit 2927697. As you can see from the diagram, the performance is on par with the performance prior to the ES upgrade.
Nolly and I both see some things with our ES implementation that can be improved; however, the current issue is resolved from a performance perspective. To Nolly's point, persona functionality should be tested to ensure it is working as it should.
Also of note, the CPU utilization is down signficantly - only about 20% utilization and there is a significant decrease in queued threads. The most queued ES search threads I saw with the fix was 9 compared to 37 prior to the fix. Most often now with the fix, the ES search queued threads are 0.
Fixed, based on the @brentgriffin comments, now we have the same required performance.
I tested filtering content according to the persona tags from a widget, also adding firing actiontlets via rules engine depending the persona and both cases works as expected.
Was trying to run some load against our docker swarm and in certain cases saw the performance drop off significantly after a couple of minutes.
Notice how in just a minute or so that "requests per second" go down and the "response time" starts going up. Thread dumps show up to 46 threads under "com.dotcms.content.elasticsearch.business.ESContentFactoryImpl.indexSearch" waiting for a response.
For comparison, I ran the same test against 4.3.2 and did NOT see this behavior:
This was done using the locust_pages.py file that requests all of the pages in the starter and does not include assets, .css, .js, etc. I tried to narrow this down to a specific page but was unable to. I specifically tried just the homepage, just the about-us page, and just the product page, but each of these runs did NOT manifest the problem.
The tar file linked below in step 3 contains thread dumps - of specific interest are imac/master/locust_pages/dump*
Expected Behavior
Would expect 5.0 to perform in the same way we see 4.3.2 perform.
Current Behavior
Described above in summary.
Possible Solution
Not sure - probably going to need to profile with glowroot or some other tool to get to the bottom of this one.
Known Workarounds
NONE at this time
Steps to Reproduce (for bugs)
Context
Your Environment