Resource allocations on Elasticsearch service

kynetiv commented 7 years ago

In order to enable College Scorecard to migrate to the GovCloud environment, we want to ensure our ES service is reliable and performant enough to handle their use-case, or work with them to re-engineer their use so that it is.

I'm noticing some differences in performance regarding the elasticsearch service instances between GovCloud and cloud.gov environments. I realize that GovCloud is using kubernetes now and I'm wondering how resources are allocated (cpu, memory, and the requests/limits containers get).

One thing I've noticed specifically was that on a brand new Elasticsearch service instance (on all versions 1.7, 2.x) that ES reports the OS memory mostly consumed just after creation. Here is a snippet of the return call to the ES rest api <host>/_nodes/stats?human&pretty of a 6x service

"os" : {
        "timestamp" : 1481736776685,
        "cpu_percent" : 0,
        "load_average" : 0.078125,
        "mem" : {
          "total" : "14.6gb",
          "total_in_bytes" : 15768809472,
          "free" : "625.9mb",
          "free_in_bytes" : 656404480,
          "used" : "14gb",
          "used_in_bytes" : 15112404992,
          "free_percent" : 4,
          "used_percent" : 96
        },

I'm not sure if this log is what is "actually" used in terms of memory, but I'd be curious to know if you all have any ideas why it seems to already be reaching the limit.

Also, to provide some context on our experience, our import job on cloud.gov with some performance tunings would run anywhere between 2-4 hours, and on GovCloud has taken up to 30 hours to run.

I'm curious to hear if anyone else is seeing issues with performance here.

kynetiv commented 7 years ago

Also, it looks like the the max open file descriptors limit has not been increased from whatever default there is on the docker box.

On a ES 2.4.1 service instance I get 8192.

es:9200/_nodes/stats/process?filter_path=**.max_file_descriptors&pretty

{
  "nodes" : {
    "RcAotz5dST2N47MbWsqGCA" : {
      "process" : {
        "max_file_descriptors" : 8192
      }
    }
  }
}

Elasticsearch recommends setting the ulimit to 65536.

I think this could probably be done in your ES Dockerfiles similar to how it is done for your FluentD CloudWatch Dockerfile.

kynetiv commented 7 years ago

Looks like the pires/docker-elasticsearch image used here has docs on the ulimit. Also, they mention this became an issue in ES 2.X.

For kubernetes, according to this issue, there may be a different way to set it (link has another solution).

EDIT: looks like the base docker-elasticsearch image does set the ulimit, but the container needs to be run as privileged.

UPDATE: well I'm at a loss now, your ES container is run as privileged here. I'm not sure why the ES rest api mentions a low file descriptors count then.

kynetiv commented 7 years ago

Hey @jcscottiii any help/insight here would be much appreciated.

This is a follow-up, and perhaps the main issue I'm facing in this thread. The ES service resource allocations on GovCloud, as they are, are not useable for our application (college scorecard) until configuration changes are made.

This atlas comment is essentially what were running into. ES throws Out of memory errors consistently and has very slow/poor performance while under heavy indexing. By limiting an instance's memory (1x, 3x, 6x) to the same amount given to the ES_HEAP, Lucene is left with little to none for it's required buffers and caches. Even in normal use (non-indexing), this configuration won't work for Lucene's necessary memory allocation.

Please let me know if I can provide more info to help move this along. As it stands right now, this is blocking us from migrating to GovCloud.

jcscottiii commented 7 years ago

@jmcarp @cnelson any idea about this?

jmcarp commented 7 years ago

Taking a look now, will update soon.

jmcarp commented 7 years ago

@kynetiv: we just updated our k8s elastic config to allocate half the memory limit to ES_HEAP_SIZE. Our broker doesn't apply that update to existing instances, so you'll have to recreate to get the new configuration. Or I can update your existing instances manually if you let me know which ones need to be changed. Hopefully this will help with the issue, but please file another issue here or email support if performance doesn't improve.

kynetiv commented 7 years ago

@jmcarp @jcscottiii Thank you for looking into this! I'm hopeful that the ES_HEAP_SIZE was the main issue. I'll recreate some instances soon and test it out. If need be I'll update this post or start a new one.

kynetiv commented 7 years ago

@jmcarp I just created a few new elasticsearch24 6x instances but I'm not seeing the update applied. Per #38, I was expecting to see the below heap_committed and heap_max set to the new ES_HEAP_SIZE of 3456m but I'm still seeing what looks like the old configuration:

"jvm" : {
        "timestamp" : 1486066905342,
        "uptime" : "14.7m",
        "uptime_in_millis" : 885931,
        "mem" : {
          "heap_used" : "233mb",
          "heap_used_in_bytes" : 244373824,
          "heap_used_percent" : 3,
          "heap_committed" : "5.9gb",
          "heap_committed_in_bytes" : 6372720640,
          "heap_max" : "5.9gb",
          "heap_max_in_bytes" : 6372720640,
          "non_heap_used" : "53.9mb",
          "non_heap_used_in_bytes" : 56609256,
          "non_heap_committed" : "55.3mb",
          "non_heap_committed_in_bytes" : 57999360,
   ...

Perhaps the new service instances were added to a container that was already built with the old configuration? To test this idea, I created a few more elasticsearch24 instances in hopes that a completely new container would need to be built considering the previous box was almost maxed out of OS memory.

...
 "os" : {
        "timestamp" : 1486066905342,
        "cpu_percent" : 0,
        "load_average" : 0.33251953125,
        "mem" : {
          "total" : "14.6gb",
          "total_in_bytes" : 15768698880,
          "free" : "1.5gb",
          "free_in_bytes" : 1680175104,
          "used" : "13.1gb",
          "used_in_bytes" : 14088523776,
          "free_percent" : 11,
          "used_percent" : 89
        },
...

However, instead the most recent instance I spun up is unresponsive to any rest api calls, (_nodes/stats , _cluster/health, etc). The only thing I can think of is that the K8s need not only a limits value but a requests value as well so that it's guaranteed and it's not added to a container that is not able to provide the 1x, 3x, 6x values requested. This is just an idea of course since I'm not experienced with K8s enough or your setup to really know what's going on here.

brittag commented 7 years ago

Just as a bit of context for cloud.gov team members, this is also now tracked as an internal support ticket here. :)

cnelson commented 7 years ago

There is ongoing discussed /remediation in #cloud-gov-agent-q. TL;DR because we haven't prioritized https://favro.com/card/1e11108a2da81e3bd7153a7a/18F-1441 and https://favro.com/card/1e11108a2da81e3bd7153a7a/18F-1393 we are now running into issues with the k8s scheduler being unable to fit our workloads onto the requested nodes.

cnelson commented 7 years ago

@kynetiv We've added a 12x plan that more accurately reflects what the 6x plan was in east. I've enabled it for your org, can you give it a try and let us know how it performs?

If it still doesn't meet your requirements, it would be very helpful to have a baseline we can test against. Something as basic as "I can index 100,000 documents averaging 10KiB each in X seconds in East, and it's Y seconds in GovCloud" or "When I perform query N in east against index Z it returns in X seconds, and in GovCloud it takes Y seconds" would help us tune GovCloud to meet your needs without having to have you test each iteration of our configuration.

cc: @jmcarp @jcscottiii

kynetiv commented 7 years ago

@cnelson thank you and the team for adding the 12x plan. I'll take a look here soon and let you know how it goes.

Regarding requirements, I'll add that as a todo for our team to get you a baseline. We realize the 12x is just an estimate of what we think we'll need at the moment (per what it seems we had on East/West). I think we'd both like to have it closer align to what we'll actually use but we appreciate this stopgap that will hopefully allow us to migrate here soon.

cnelson commented 7 years ago

@kynetiv Just checking in to see if you've had a chance to see how this plan performs.

jcscottiii commented 7 years ago

@kynetiv sent me a slack message yesterday:

unfortunately, we're still seeing slow performance and ultimately out of memory errors, however the 12x plan does get us further. I'm trying to see if I can get some kind of baseline (per the github request)

rogeruiz commented 7 years ago

@jcscottiii, @kynetiv

is it possible to provide a better baseline? Something along the lines of what @cnelson mentioned?

"I can index 100,000 documents averaging 10KiB each in X seconds in East, and it's Y seconds in GovCloud" or "When I perform query N in east against index Z it returns in X seconds, and in GovCloud it takes Y seconds"

kynetiv commented 7 years ago

@rogeruiz, @cnelson, @jcscottiii - Sorry for the delays, I'm working to get a baseline for you but have had some other deadlines on another project. I hope to have something to share this afternoon / tomorrow morning.

kynetiv commented 7 years ago

@rogeruiz, @cnelson, @jcscottiii

Here's my attempt at an elasticsearch baseline for indexing our application in both cloud.gov environments. See the below caveats for some things to note about how the indexing was run between environments.


East/West | service: elasticsearch-swarm-1.7.5 | plan: 6x 

Average document size: 37kb

Total number of documents: 7703

Total number of indexed documents: 154060 ( 7703 x 20 )¹

Total indexing time: 131.4 minutes (about 2 hours)

Average time to index 1 document: 0.111079 seconds

Median time to index 1 document: 0.068 seconds

Fastest time to index 1 document: 0.003 seconds

Slowest time to index 1 document: 14.906 seconds

Fastest time to process 1 csv file: 242.8 seconds ( about 4 minutes)

Longest time to process 1 csv file: 570.4 seconds (about 10 minutes)


GovCloud | service: elasticsearch24 | plan: 12x

Total indexing time (before oom): 384.57 minutes (about 6.4 hours)²

Average time to index 1 document: 0.627653 seconds

Median time to index 1 document: 0.107 seconds

Fastest time to index 1 document: 0.003 seconds

Slowest time to index 1 document: 554.226 seconds

Fastest time (recorded) to process 1 csv file: 518.9 seconds (about 9 minutes)

Longest time (recorded) to process 1 csv file: 3298.4 seconds (about 54 minutes)

¹ The application processes 19 large csv files but only creates the documents in elasticsearch on the first pass of the first file. All subsequent files do an update request to the document appending a new object to the document (with the average size being appended of 37kb). The first file sets up the ES document, hence we process the 7703 documents 20 times.

² GovCloud indexing did not complete due to out-of-memory errors. Out of 19 files, 5 files did not get indexed at the time of crash. This data is for the first 14 files.

Caveats: I wasn't able to run the same versions of elasticsearch in both environments which does pose a problem to this baseline, however, we don't have an ES version 2.x service in the E/W environment to perform a baseline on, nor was the 1.x version in GovCloud working at the time of indexing (#43). Additionally, there isn't a 12x plan for the ES 1.x service in GovCloud, so we would need to add that as well. Ideally we would add an ES 2.4 service to E/W in order to get better / closer parity.

With all that said, I'm happy to provide more detail or clarify these results. I'm also happy to setup a instance of our application for you to run tests or pair with you to test different configurations. Thanks for your help and time in looking into this issue.

UPDATE: I updated the E/W averages after finding a mistake.

cnelson commented 7 years ago

Thanks for the info, this will help figure out what's up with the GovCloud performance We'll need to do some investigation on our side before it makes sense to pair with you on testing performance, but will reach out as soon as we think we have a resolution.

jmcarp commented 7 years ago

@kynetiv: I ran a benchmark roughly similar to your use case on a govcloud elastic instance and didn't reproduce the specific issues you've been running into. I'm guessing we'd make better progress running your specific indexing scripts. Could you either point me to those scripts or let me know when you'd be free to pair on this?

kynetiv commented 7 years ago

@jmcarp thanks for taking a look. It probably would speed things up if we paired on this. I'm available now for about 45 minutes or later today after 2:45pm ET. I'm @kynetiv on our shared slack channel if you want to ping me some other times too.

jmcarp commented 7 years ago

I'll talk with the College Scorecard team later today; moving to "waiting" until then.

suprenant commented 7 years ago

From standup, @cnelson has been working on this as white noise in between other issues so we can hopefully have a clearer issue of the root cause here soon.

cnelson commented 7 years ago

@kynetiv, Do you have some time to discuss this?

I've ran several dozen tests over the last couple of weeks and this seems to boil down to the fact that ES2.4 is less performant than 1.7 when it comes to using the update API with large numbers of nested fields. Other than the fact that ES2.4 needs at least 4GB heap to complete this process when ES1.7 could do it with a 3GB heap, adding even more resources to the ES instance isn't going to make those operations any faster.

I'm able to successfully run your import against our "12x" plan (which has a 6GB heap), but as mentioned above it is 2-3 hours slower than an import to 1.7 on the same hardware.

Is this acceptable for the app to migrate to GovCloud?

kynetiv commented 7 years ago

@cnelson, Thanks for your effort here. I'm available today if you want to have a quick call. In my experience we haven't been able to successfully run the import on GovCloud so I'm certainly curious to hear how that was possible. Do you want to ping me in slack when you have a free moment to discuss? Thanks!

cnelson commented 7 years ago

Successfully ran the import process on GovCloud yesteday, waiting on RTI to confirm.

kynetiv commented 7 years ago

@cnelson really appreciate you looking into this and sharing some of the details regarding the performance considerations with ES2.4. I think we're good to close this out. Thanks to everyone on the team as well for the support!

cloud-gov / kubernetes-broker

Resource allocations on Elasticsearch service #34