brentgriffin commented 6 years ago

Was trying to run some load against our docker swarm and in certain cases saw the performance drop off significantly after a couple of minutes.

locust

Notice how in just a minute or so that "requests per second" go down and the "response time" starts going up. Thread dumps show up to 46 threads under "com.dotcms.content.elasticsearch.business.ESContentFactoryImpl.indexSearch" waiting for a response.

For comparison, I ran the same test against 4.3.2 and did NOT see this behavior: locust

This was done using the locust_pages.py file that requests all of the pages in the starter and does not include assets, .css, .js, etc. I tried to narrow this down to a specific page but was unable to. I specifically tried just the homepage, just the about-us page, and just the product page, but each of these runs did NOT manifest the problem.

The tar file linked below in step 3 contains thread dumps - of specific interest are imac/master/locust_pages/dump*

Expected Behavior

Would expect 5.0 to perform in the same way we see 4.3.2 perform.

Current Behavior

Described above in summary.

Possible Solution

Not sure - probably going to need to profile with glowroot or some other tool to get to the bottom of this one.

Known Workarounds

NONE at this time

Steps to Reproduce (for bugs)

Setup a clean system - for me this was downloading and untarring the 20180403 nightly build and configuring it to use a fresh postgres DB
Install locust.io - https://locust.io/ - i.e. pip3 install locust
Untar - https://drive.google.com/open?id=1fCjtNCVXmpCPAoW9n4QXLU_CztPau_lR
Once you are sure dotCMS has finished its initial startup, run the following from where you untarred the file to: locust -f locust_pages.py -H http://localhost:8080
Open http://localhost:8089/ in a browser
Enter 50 for # of users
Enter 5 for Hatch rate
Click "Start Swarming"
Click on "Charts" in the nav to observe behavior

Context

Your Environment

dotCMS version used: master nightly build from 20180403
Browser Name and version: N/A
Operating System and version: Alpine Linux and OSX 10.13.3
Application Server and version: embedded Tomcat 8.5.24
Java Brand and version: Java HotSpot(TM) 64-Bit Server VM (build 25.141-b15, mixed mode) and OpenJDK 8
Database and version: Postgres 10.1 and 9.6
Application module (if apply): N/A

brentgriffin commented 6 years ago

As I was writing up the details here, it occurred to me that the timeout setting on the tool may play a factor here. I will increase that setting to see what if any impact that has on the results. Either way, 4.3.2 was run with the same settings and did not manifest the problems.

brentgriffin commented 6 years ago

In looking at the tool, the setting I was thinking of is not relevant here - so no additional data to add at this time.

brentgriffin commented 6 years ago

If it helps, here is some glowroot output for the contact-us and root pages. Under load they seem to be the pages that are slowing down the most. https://gist.github.com/brentgriffin/5b17dea5f8c42d2bae1820f9ab9fef49

you have to copy paste these files into an html file and then open inbrowser to see profile info

wezell commented 6 years ago

One issue I see is that the KeyValueAPIImpl does not use a 404 cache, and so hits ES very hard - the contact-us page uses this for form labels.

brentgriffin commented 6 years ago

Update:

Based on the committed changes, I wanted to rerun my tests and compare the results. I ran this on my local Ubuntu 16.04 machine (bare metal) without Docker. I am using a build of commit 0e6c4a3 from the master branch - it was the latest commit available when I started this testing yesterday. Running with Postgres 9.5

Summary:

Things are better than they were but the average response time per requests continues to increase slowly even after 7.5 hours in this test. (in another test i have running on aws in Docker and Postgres 9.6, it is continuing to increase after 16 hours)

Details:

My theory after initially filing this issue was that threads were queueing up for ES searchIndex and not getting processed in order so some requests were never getting processed or only after significant delays. Based on the data from this longer run, I believe that this is proved to be true. Here are threaddumps taken while dotcms was still under load. The thread dumps were taken every minute for a period of five minutes. The final threaddump was taken at 7:32 this morning and shows these threads waiting from these times:

12:30:29 - 4 Threads
12:30:30 - 3 Threads
12:30:31 - 1 Thread
12:30:49 - 2 Threads
12:30:56 - 1 Thread
12:31:40 - 1 Thread
12:32:02 - 13 Threads
12:34:54 - 2 Threads
12:36:35 - 2 Threads
12:41:17 - 1 Thread
12:43:09 - 1 Thread
1:05:50 - 1 Thread
2:03:34 - 1 Thread
7:18:00 - 1 Thread
7:18:08 - 2 Threads
7:18:10 - 2 Threads

7:18:11 - 3 Threads

41 Threads waiting for ES index search between 14 minutes and 7 hours. Their stacks look like this:

- parking to wait for  <0x0000000XXXXXXXXX> (a org.elasticsearch.common.util.concurrent.BaseFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:251)
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:94)
at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:39)
at com.dotcms.content.elasticsearch.business.ESContentFactoryImpl.indexSearch(ESContentFactoryImpl.java:1419)
at com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.searchIndex(ESContentletAPIImpl.java:816)
at com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.search(ESContentletAPIImpl.java:678)
at com.dotcms.content.elasticsearch.business.ESContentletAPIImpl.search(ESContentletAPIImpl.java:669)

As you can see from these threaddumps, once the load was stopped, the threads finished processed and were no longer waiting. So, in my thinking, this proves that when dotCMS is queuing up ES reqests that those requests are not processed anywhere near the order they enter the queue. Perhaps this is a known ES issue or the way we are invoking ES. It seems that some requests may not ever be processed until the load on ES is decreased - at the very least after some very long delays.

brentgriffin commented 6 years ago

Oh, one other thing. JVM Heap was set to 4 GB and GC was not a problem. There had been 0 full GCs and only 3 to 4 seconds per minute were spend doing minor collections.

nollymar commented 6 years ago

PR Core: https://github.com/dotCMS/core/pull/14151

brentgriffin commented 6 years ago

Tested commit with fixes 7a060f9 - still shows the same problems.

Currently testing with the commit prior to the ES upgrade 7fa4c7f to verify that it definitely related to the ES upgrade. Should have official results in the morning but based on the first 10 minutes I do not believe that it is going to have the same problem.

brentgriffin commented 6 years ago

As you can see commit 7fa4c7f does NOT have this problem. It ran for over 8.5 hours and its response time is incredibly consistent.

Also of note is the cpu usage. In the tests that have problems all 8 cores are maxed at 100% utilization, but in this test that does not have a problem the 8 cores are averaging less than 25% utilization. This leads me to believe two things:

Our configuration of elasticsearch or mapping of our content into elasticsearch is causing the slowdown (the only other option I can think of is that this version of elasticsearch is significantly slower than the older version which given what I know about ES dev and testing processes seems unlikely, but perhaps our usecase is unique.)
When ElasticSearch maxes out the CPUs, its queuing of requests is non deterministic which can lead to threads starving to have their requests processed which would lead to requests timing out and overall degregation of system performance - I believe that I may have seen this issue manifested previously on some of our clients so, I am going to do an additional test to see if I can max out the earlier version of ES. It would be good for support to know if this issue is possible in earlier versions.

brentgriffin commented 6 years ago

I attempted to recreate in the commit prior to the ES upgrade (7fa4c7f) and could not even with additional load added to the current load. (added apache benchmark hitting homepage - 20 remote & 10 local)

In the current build (7b64b93), glowroot shows that delays for long transactions being that they are waiting on ES IndexSearch.

Using the ES REST API, in both builds, I can see a significant difference in search queue size. For current build, I regularly see all 50 clients waiting on ES - there are 13 max threads and all 13 threads are always searching and it is not unusual to see 30+ threads in the search queue. In the old build (7fa4c7f), it is not uncommon to see single digit (including 0) active searches and I have never seen the waiting queue above 8. Usually, the waiting queue is 0.

Based on the information I have at this point, it seems our ES searches are taking considerably longer than they used to. My current thought is to see if I can find a specific ES query via the ES REST API that is considerably slower in the current build. If I can find such a query, it would remove our ES client code from consideration and give a much simpler test case for troubleshooting than we currently have.

Thoughts? If so, please share them.

nollymar commented 6 years ago

After adding the following slowlog configuration on ESIndexAPI.java:

map.put("index.search.slowlog.threshold.query.warn", "1s");
map.put("index.search.slowlog.threshold.query.info", "1s");
map.put("index.search.slowlog.threshold.query.debug", "1s");
map.put("index.search.slowlog.threshold.query.trace", "1s");
map.put("index.search.slowlog.threshold.fetch.warn", "1s");
map.put("index.search.slowlog.threshold.fetch.info", "1s");
map.put("index.search.slowlog.threshold.fetch.debug", "1s");
map.put("index.search.slowlog.threshold.fetch.trace", "1s");
map.put("index.search.slowlog.level", "info");

I found some slow queries: https://gist.github.com/nollymar/42fd03560f9bc3bbc0f42035962fa611

brentgriffin commented 6 years ago

Nice - reviewing slow queries now

I was making the setting via the ES REST API - for future ref: curl -X PUT "http://localhost:9200/INDEXNAME/_settings?pretty" -H "Content-Type:application/json" -d '{"index.search.slowlog.threshold.query.warn": "1s"}' Just have to replace INDEXNAME with the name of your index - this requires the ES HTTP setting to be true but is nice in that it can change config at runtime

nollymar commented 6 years ago

Additionally, after running curl localhost:9200/_nodes/hot_threads several times, I found this: https://gist.github.com/nollymar/c0768fd408ac55942d152b55cd7b0e01

nollymar commented 6 years ago

Another finding: http://localhost:9200/_nodes/stats?pretty shows the miss_count increasing during performance tests, which makes me think that request queries are not being cached correctly. After running locust for 10 minutes I got these results:

"request_cache" : {
          "memory_size_in_bytes" : 20943,
          "evictions" : 0,
          "hit_count" : 207,
          "miss_count" : 43
        }

brentgriffin commented 6 years ago

I have captured the same queries in both versions for comparison - https://gist.github.com/brentgriffin/dd52f1da1cfb49f3c88d76a590b6e493

I am an ES newbie so I am not sure if the differences are significant or not but I will highlight them here so others can weigh in. The two differences that seem they may be significant to me are:

Previously "_cache:true" was included on each query but not in current version
Previously "inode" and "identifier" were included in the fields of the query but the current version includes those in "_source" instead

nollymar commented 6 years ago

@brentgriffin I think your last comment is aligned to the request cache issue I found. Regarding your comment: "Previously "inode" and "identifier" were included in the fields of the query but the current version includes those in "_source" instead", it's true because in ES 6 those fields have to be read from _source, it's related to my last commit: https://github.com/dotCMS/core/commit/1fc8b638daf04bf254e08a6c948096f938d87a44. Previously, we're retrieving all fields which was inefficient.

In ES 6, fields array is used only for stored fields https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-stored-fields.html. We're not using stored fields because it's not recommended, in the doc they suggest to get fields from _source instead.

nollymar commented 6 years ago

Scoring calculation is taking a lot of time. I think we need to verify which queries really need scoring and prefer filters instead of queries. Many of them will be cached and the performance improved: https://opensourceconnections.com/blog/2017/07/10/caching_in_elasticsearch/ https://www.elastic.co/guide/en/elasticsearch/reference/6.2/query-filter-context.html

nollymar commented 6 years ago

This query is taking almost 3 seconds to execute:

14:22:51.144  WARN  slowlog.query - [live_20180419141207][0] took[2.9s], took_millis[2978], types[],
 stats[], search_type[QUERY_THEN_FETCH], total_shards[1], source[{"size":5,"query":{"query_string":
{"query":"+structurename:news +(conhost:48190c8c-42c4-46af-8d1a-0cd5db894797 
conhost:system_host) tags:fund*^3 tags:asset allocation*^2 tags:earnings*^2 tags:long ideas*^2 
tags:index*^2 tags:long & short ideas*^2 tags:services*^2 tags:jewelry stores*^2 +languageid:1 
+live:true +deleted:false +((+owner:anonymous +ownercanread:true) 
(permissions:pedecd377-2321-4803-aa8b-89797dd0d61f.1p* permissions:p654b0931-1027-41f7-ad4d-173115ed8ec1.1p* )
 (permissions:p654b0931-1027-41f7-ad4d-173115ed8ec1.1p*) ) +live:true","fields":
[],"type":"best_fields","default_operator":"or","max_determinized_states":10000,"
enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,
"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,
"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}},
"_source":{"includes":["inode","identifier"],"excludes":[]},"sort":[{"_score":{"order":"desc"}},
{"news.syspublishdate":{"order":"desc"}}]}],

nollymar commented 6 years ago

This PR improves the performance of personalization queries: https://github.com/dotCMS/core/pull/14172

nollymar commented 6 years ago

Note to QA: Personas feature should be tested after this fix because we were not filtering content with the right tags. This issue is happening in 4.x versions too.

brentgriffin commented 6 years ago

Good work, Nolly! This seems to be the fix we were looking for. Our degrading ES performance issue seems to be resolved.

I verified the performance improvement of this fix by building and testing commit 2927697. As you can see from the diagram, the performance is on par with the performance prior to the ES upgrade.

Nolly and I both see some things with our ES implementation that can be improved; however, the current issue is resolved from a performance perspective. To Nolly's point, persona functionality should be tested to ensure it is working as it should.

brentgriffin commented 6 years ago

Also of note, the CPU utilization is down signficantly - only about 20% utilization and there is a significant decrease in queued threads. The most queued ES search threads I saw with the fix was 9 compared to 37 prior to the fix. Most often now with the fix, the ES search queued threads are 0.

bryanboza commented 6 years ago

Fixed, based on the @brentgriffin comments, now we have the same required performance.

I tested filtering content according to the persona tags from a widget, also adding firing actiontlets via rules engine depending the persona and both cases works as expected.

dotCMS / core

ES Performance issue with master branch (5.0) #14032

Expected Behavior

Current Behavior

Possible Solution

Known Workarounds

Steps to Reproduce (for bugs)

Context

Your Environment

Update:

Summary:

Details: