Having to run 12x Elastic Rally instances on the `elastic\logs` track to bottleneck the CPU on the hot data tier

While I don't have anything super useful to add here in terms of replacements, I would just like to throw my anecdotal hat into this ring with respect to the elastic/logs track I was trying to run against our new NVMe backed hot data tier on on-prem hardware within an ECE cluster. The results I was getting scaling from targeting 1 shard to 2 shards and beyond didn't improve the overall indexing throughput. I specifically increased the corpus size to around 60 days of data to ensure I had plenty of events to index. My goal was to understand the behaviour the new cluster with respect to hot spotting, shard and replica counts. Unfortunately, Elastic Rally initially gave me the wrong idea.

It wasn't until I ran multiple copies of Elastic Rally with identical settings concurrently from the same host was I able to actually start approach any of the hardware limits in the cluster. In the end, I had to run 12x Elastic Rally instances on the elastic\logs track to bottleneck the CPU on the hot data tier. I executed all 12 instances from a single server (backed by NVMe, 128 GB of RAM, 32c/64t, 10 Gb network). This resulted in the actual indexing rate rising from 60-70,000 doc/s to 550-600,000 docs/s. The reality was that the server sending the logs weren't a limiting factor, nor were the hot data tier nodes, but Elastic Rally in quickly providing the documents fast enough to index.

My suspicion was that, similar to the Golang stdlb for encoding/json, that the performance is not super optimised in Python. This issue seems to validate that theory, I just wanted to provide a real world example of where Elastic Rally performance is producing results that could be easily misconstrued by naive users such as myself.

Originally posted by @berglh in https://github.com/elastic/rally/issues/1046#issuecomment-1225252763

@berglh I created this new issue as your performance problem is not JSON-related.

The flamegraph is indeed what I was after. If I focus on what appears to be one process:

You can see, from left to right:

Rally's actor system on the left waiting to receive a message
Python ThreadPoolExecutor waiting to receive work
And finally on the right hand side Rally is busy sending bulk requests by reading them from a file

I don't think the first two use any CPU time, so only the third item is really active. There's nothing that stands as obviously slow here, so I'm not sure what to suggest next! In our tests a single Rally instance was enough to sustain 300,000 docs/s using that same track, and the bottleneck was not Rally or the underlying load driver but the 6-node cluster we were benchmarking maxing out its own CPU!

@DJRickyB mentions that the next step is to have more indexing clients: https://github.com/elastic/rally-tracks/tree/master/elastic/logs#indexing-parameters. A good rule of thumb here is to use number of nodes number of cores 2. So say you're benchmarking a 3-nodes cluster with 32 cores each, you should have 192 indexing clients. The default number is 8, which is very low. Maybe that should be a required parameter.

@pquentin Thanks, I'll give this a go toady. I was able to see peak indexing of the http_logs track with 12 instances (so I guess 96 indexing clients) with the no-pipelines track hitting over 2,000,000 doc/s max from the 12 instances running on a single node - the resource utilisation on the sending host was comparatively low. NVMe backed hot data tier on 6 physicals with 32-cores each. I'm guessing we'll be looking at 6 physical hosts 32 cores 2 (384 clients). I spread the 12 Rally instances to POST documents across 6 different data tier nodes to try to avoid hot spotting a single data node.

Due to the current network architecture of our ECE Elastic deployment, our nodes are unable to hit the F5 load balancer to then hit the ECE Proxies (HA Proxy). Do you know if Rally is able to specify HTTP headers?

We have a wildcard domain pointed at the floating IP address of our F5 load balancers, as per the ECE recommended architecture. By changing the headers of Rally HTTP client, I can then configure the "Host: instance.es.elastic.my.domain.edu.au" header to test via the ECE proxies while also bypassing our F5 load balancer, which the host is not authorised to send packets to. I could see there is query_parameters option, but that will just be the URI parameters for the ES query itself, rather than HTTP client configuration.

Currently, as I'm doing the testing from our frozen tier host behind the F5, I must figure out the random ports on each physical server running the hot data tier instances, which needs to be mapped to individual ES instance numbers via the ECE UI/API. If I can target the ECE proxies, and specify the host header, I can "spoof" coming from the load balancer and simultaneously test the performance of the ECE proxies (which I'm guessing the three proxies will not be a major bottleneck for us).

I appreciate your help and time in figuring this out @pquentin. I have found the documentation a little bit difficult to navigate if I'm honest for testing a more performant cluster. I did find this track in the rally-tracks repo, but I didn't see this specific markdown page for elastic/logs, I jumped into track.json trying to figure out the track options based on this page Command Line Reference. Somehow, I missed the page you linked me which is much clearer about all the options of the individual track, I'm sorry for taking your time.

Ok, here are my results by increasing the bulk_indexing_clients to 384 as per your instructions @pquentin. I observed some quite strange regression at 2 and 3 shards, then approaching 7-8 shards I started to notice CPU bottle-necking on the shipping host. In this instance, I supplied all twelve data nodes HTTP ports in the target hosts, it seemed to do a relatively OK at distributing the pipeline load across these servers and the throughput even on a single shard was surprising to me. I guess I should try reducing the bulk clients to see if I observe better scaling.

elastic_logs_0_replicas_384_clients

Due to the current network architecture of our ECE Elastic deployment, our nodes are unable to hit the F5 load balancer to then hit the ECE Proxies (HA Proxy). Do you know if Rally is able to specify HTTP headers?

Sure, you can do this through the --client-options flag that you are already using. Any unknown options are passed to the Elasticsearch Python client, which supports specifying custom headers as a Python dictionary. See the Advanced topics section to understand how to pass that value to --client-options. Two things to note:

you can no longer use comma-separated values, you need to use JSON, either inline or using a JSON file,
you need to use the default key which was automatically added for you until now.

Here is what your JSON would look like if in its own file:

{
  "default": {
    "use_ssl": true,
    "verify_certs": false,
    "basic_auth_user": "elastic",
    "basic_auth_password": "password",
    "headers": {"Host": "instance.es.elastic.my.domain.edu.au"}
  }
}

I appreciate your help and time in figuring this out @pquentin. I have found the documentation a little bit difficult to navigate if I'm honest for testing a more performant cluster. I did find this track in the rally-tracks repo, but I didn't see this specific markdown page for elastic/logs, I jumped into track.json trying to figure out the track options based on this page Command Line Reference. Somehow, I missed the page you linked me which is much clearer about all the options of the individual track, I'm sorry for taking your time.

No worries! We open-sourced this track just last month and I think you're the first person outside Elastic to use it, so some rough edges are expected. We could do a better job explaining how to use that track and how to tune the parameters. The first step we'll be discussing will be to make crystal clear that the number of indexing clients has to be tuned. We never use the default so this took my by surprise too.

How could we make the documentation at https://github.com/elastic/rally-tracks/tree/master/elastic/logs easier to find? GitHub shows it when browsing the directory and it's called "README" already :) Should we link to it from https://github.com/elastic/rally-tracks/tree/master/elastic maybe?

Ok, here are my results by increasing the bulk_indexing_clients to 384 as per your instructions @pquentin. I observed some quite strange regression at 2 and 3 shards, then approaching 7-8 shards I started to notice CPU bottle-necking on the shipping host. In this instance, I supplied all twelve data nodes HTTP ports in the target hosts, it seemed to do a relatively OK at distributing the pipeline load across these servers and the throughput even on a single shard was surprising to me. I guess I should try reducing the bulk clients to see if I observe better scaling.

Thanks for sharing the results. I'm also not sure about the behavior at 2/3 shards, it could be just noise. And yes, tuning the indexing clients can only be a good idea (eg divide it by two if the shipping host is bottlenecked on CPU).

Regarding bottlenecking the shipping host, I think switching to a single Rally instance will help. But for larger clusters like yours, you may have to use multiple load drivers to achieve peak performance.

Sure, you can do this through the [--client-options flag]

Brilliant, that does make a lot of sense. I shall give that a go.

No worries! We open-sourced this track just last month and I think you're the first person outside Elastic to use it, so some rough edges are expected.

Interesting, I thought it was a fantastic track for me to simulate a similar workload to our legacy cluster. I'm grateful it's been added as we are previously using Logstash for all ETL, so having some useful Elasticsearch based pipeline-based workloads is very useful as we're going to leverage Elastic Common Schema and Beat processors as much as possible.

How could we make the documentation at https://github.com/elastic/rally-tracks/tree/master/elastic/logs easier to find? GitHub shows it when browsing the directory and it's called "README" already :) Should we link to it from https://github.com/elastic/rally-tracks/tree/master/elastic maybe?

I think the main thing that made me not look for documentation was the message on the repository main page. It gave me the impression I shouldn't be there unless I want to create my own tracks.

You should not need to use this repository directly, except if you want to look under the hood or create your own tracks.

The content of the track READMEs are excellent, so there is no issue with README or markdown documentation, it just wasn't clear to me from the main website or the main page of the repository that tracks are well documented inside the repository. I was just looking directly for the track.json as mentioned on the website; which didn't really give me the context I needed. Of course, I get a list of parameters in Rally output if I type one wrong, but they were not well described in the Rally output from memory. Perhaps linking to each track/subtrack README from the primary repository README would be useful for clarity at a minimum to show that there is good documentation there.

Optionally, if there was some clear information on configuring the tracks on the main website that mentioned for each track in the rally-tracks repository, you can find the all the parameters explained in exquisite detail in the track subdirectory README.

I will go through and adjust the parameters for bulk_indexing_clients and ship via the proxy as per your advice. Before I left work I had fired off with 64 bulk_indexing_clients and I'm seeing half the throughput I saw with 384. This would indicate to me that it's just a matter of finding the sweet spot. It still got me wondering though, where is the latency on each individual client if having nproc * 2 of the shipping host indexing clients is unable to saturate the CPU of the host system (assuming storage/RAM/network isn't the bottleneck - which in my case we know it isn't). There still seems to be inefficiency we're fighting with the log shipper - it's possible it's well known for you already; it was simply a curiosity to me.

With respect to the 2-3 shard performance degradation; we have six physical servers and due to the way ECE works, the Elasticsearch instances running on the same physical host must share the same underlaying logical storage volume for persistent storage. My current hypothesis is that the primary shards may have been scheduled on the same physical host, on two different data instances and not distributed across the physical hosts optimally: subsequently using the same underlaying logical storage volume. This may also explain trail off in indexing at shard counts above 6 when the shipper isn't CPU bound. This is because I had not setup any of the cluster level shard routing settings. Honestly, I was surprised that ECE didn't configure this for me by default when creating a deployment as it knows about both the hosts and the availability zones inherently. I've since added in host and availability zone-based awareness, so hopefully the primary shard scheduling will be a bit more optimal, and I can at least eliminate this as a cause - the first couple of multi-shard results seemed to show the expected increase in throughput from 1 to 2 shards and the primary shards dodged the underlaying host as expected.

I think the main thing that made me not look for documentation was the message on the repository main page. It gave me the impression I shouldn't be there unless I want to create my own tracks.

I opened https://github.com/elastic/rally-tracks/pull/309, please tell me what you think!

Perhaps linking to each track/subtrack README from the primary repository README would be useful for clarity at a minimum to show that there is good documentation there.

Not sure listing would help as much, I think the directory layout makes it clear that all subdirectories are tracks. And the list would quickly get stale.

Optionally, if there was some clear information on configuring the tracks on the main website that mentioned for each track in the rally-tracks repository, you can find the all the parameters explained in exquisite detail in the track subdirectory README.

I opened https://github.com/elastic/rally/pull/1568, please tell me what you think!

This would indicate to me that it's just a matter of finding the sweet spot. It still got me wondering though, where is the latency on each individual client if having nproc * 2 of the shipping host indexing clients is unable to saturate the CPU of the host system (assuming storage/RAM/network isn't the bottleneck - which in my case we know it isn't). There still seems to be inefficiency we're fighting with the log shipper - it's possible it's well known for you already; it was simply a curiosity to me.

By default, indexing goes "as fast as possible", but it still waits for each request to be completed before sending another one. This is why you need more indexing clients to saturate the CPU, the work the load driver does is I/O bound. Does that make sense? I'm not sure I've understood your point.

Now, If you want to know the latency for each client, you can configure a metrics store and look at the metrics.

My current hypothesis is that the primary shards may have been scheduled on the same physical host, on two different data instances and not distributed across the physical hosts optimally.

Makes sense!

Anyway, I'm going to close this issue now as there's nothing actionable for Rally left here. Thanks!

Thanks for your help @pquentin

elastic / rally

Having to run 12x Elastic Rally instances on the `elastic\logs` track to bottleneck the CPU on the hot data tier #1566