TechEmpower / FrameworkBenchmarks

Source for the TechEmpower Framework Benchmarks project
https://www.techempower.com/benchmarks/
Other
7.63k stars 1.94k forks source link

Unstable Performance Among Some Java Test Implementations #5612

Open msmith-techempower opened 4 years ago

msmith-techempower commented 4 years ago

I was troubleshooting what I believed to be a performance degradation in Gemini (and spent a lot of time doing so) when I believe I came to the realization that it is a problem not in Gemini proper. This issue will lay out all the information we have gathered.

For those unfamiliar, it is my pleasure to introduce the Framework Timeline which graphs the continuous benchmark results over time. This tool is great for illustrating the arguments that I will be laying out. This link is to the plaintext results for gemini.

The following is an annotated graph from gemini's Framework Timeline:

image

  1. ServerCentral hardware/network - everything is relatively stable 0a. #3292 was merged and the project was officially Dockerified
  2. There are several of these dips on the graph, but the graph is combination of all environments so these are actually the Azure runs, which are more modestly provisioned as compared to the Citrine environment
  3. Migrated out of ServerCentral and started running continuously on Citrine on prem.
  4. Starts with a big dip which is Azure, then is relatively stable but much lower than 2. After going through emails and chat messages, we believe this is due to applying the Spectre/Meltdown kernel patches.
  5. Ubuntu 16 is replaced with CentOS 7.6.1810 (Core) and I forgot to apply Spectre/Meltdown kernel patches (side-story, I was trying to get upgraded networking hardware working that ended up being unusable, so I was busy and had a great excuse)
  6. Unclear what this is - it does not appear to be low enough to be Azure runs and it aligns with some later bullets, so I'll discuss below.
  7. Our best guess is that this is a dip from #4850 which changed the base image of many Java test implementations. The timing lines up pretty much exactly, though it is a bit of a mystery as to why moving from openjdk-11.0.3-jre-slim to openjdk-11.0.3-jdk-slim would have a performance impact. Found an email chain wherein @nbrady-techempower confirmed that he once again applied Spectre/Meltdown patches and an iptables rule from this
  8. Nov 8, 2019 - last continuous run on CentOS - we then bring down the machines and begin installing Ubuntu 18 TLS
  9. Nov 20, 2019 - first continuous run on Ubuntu LTS with Spectre/Meltdown kernel patches applied but not this iptables rule
  10. This is the high-water mark for gemini on Citrine (Ubuntu) - roughly 1.2M plaintext RPS
  11. This is the low-water mark for gemini on Citrine (Ubuntu) - roughly 700K plaintext RPS

The following shows the data table for Servlet frameworks written in Java for Round 18 published July 9, 2019 which is between number 6 and 7 on the above graph.

image

Comparing that with the data table for the same test implementations from the run completed on April 1, 2020 which is the last graphed day (as of this writing) on gemini's Framework Timeline.

image

This shows degradation across the board for Java applications, but some are impacted more than others.

For comparison, the following is servlet's plaintext Framework Timeline:

image

  1. The same dip we believe is due to the base Java image being changed in #4850
  2. Nov 20, 2019 - first continuous run on Ubuntu LTS with Spectre/Meltdown kernel patches and as I indicate with the horizontal line it has been "relatively" stable though the data does tables above does show some degradation
  3. Azure runs

We merged in some updates to Gemini today which included updating the Java base image to openjdk-11.0.7-slim which should be the same as openjdk-11.0.7-jdk-slim. So, if there was some weirdness with openjdk-11.0.3-jdk-slim from #4850 then the next run will show improved plaintext numbers for Gemini.

However, that may be unrelated, so other tests I will probably do in the next hour or two:

[ ] - Downgrade tapestry to openjdk:11.0.3-jre-stretch which was the version prior to #4850 [ ] - Upgrade wicket to openjdk:11.0.7-slim which would eliminate any question if gemini improves and wicket improves [X] - Verify versions of openjdk:11.0.3-jre-stretch and openjdk:11.0.3-jdk-stretch have the same underlying JRE see below [X] - Verify gemini plaintext are not leaking connections see below

msmith-techempower commented 4 years ago

It turns out that openjdk:11.0.3-jre-stretch and openjdk:11.0.3-jdk-stretch are not the same underlying JRE (thanks to @nbrady-techempower for finding these):

openjdk:11.0.3-jre-stretch: image

openjdk:11.0.3-jdk-stretch image

msmith-techempower commented 4 years ago

gemini is not leaking connections in its plaintext test (thanks to @michaelhixson for finding this):

Repro steps: tfb --test gemini --mode debug docker ps | grep gemini to find container id docker exec -it bash <container-id> Inside the gemini-mysql container, run watch 'ss -tan | wc -l' to continuously print out the total number of connections From another bash session on host, run docker run --rm techempower/tfb.wrk wrk -H 'Host: host.docker.internal' -H 'Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 512 --timeout 8 -t 8 "http://host.docker.internal:8080/update?queries=20" In the terminal running the watch command, watch the number of connections climb continuously

joanhey commented 4 years ago

The same pattern exists in php or nginx. And perhaps in more languages.

https://tfb-status.techempower.com/timeline/php/plaintext https://tfb-status.techempower.com/timeline/nginx/plaintext

And a big drop in 18 June, 2019. But I thought it was due to the CVE-2019-1147x patches. (but it's only for Microsoft systems)

Really good tool the Framework Timeline :wave: , Will be better with annotated marks about that big changes in the benchmark.

joanhey commented 4 years ago

I have been investigating a estrange problem for some time. And after check the Timeline, curiously It also starts at 18 June, 2019.

The problem

In the last runs Kumbiaphp-raw is slower than Kumbiaphp with ORM. It does not make any sense, and I think it will affect the plain php also.

Fortunes Test Round 18 Actual runs
PHP 129,288 95,832
Kumbiaphp raw 90,377 73,245
Kumbiaphp orm 76,710 73,752

https://tfb-status.techempower.com/timeline/php/fortune https://tfb-status.techempower.com/timeline/kumbiaphp-raw/fortune https://tfb-status.techempower.com/timeline/kumbiaphp/fortune

It's impossible for raw version to be slower than the ORM version, in all the runs after 18 June.

I was thinking with a bad php stack config. But after read this issue, I think that perhaps would be a problem with the benchmark stack. I'll investigate more about that problem.

msmith-techempower commented 4 years ago

@joanhey Below is the graph for Kumbiaphp, for reference, and it does indeed see that dip on June 18, 2019. Curiously, it seems to recover on Nov 20, 2019. image

msmith-techempower commented 4 years ago

I have edited the original post to indicate that on Jun 18, 2019, @nbrady-techempower applied the Spectre/Meltdown kernel patches, and we believe that those account for the dip.

joanhey commented 4 years ago

Yes it recover in Nov 20, like plain PHP. But I can't understand the reason. No changes in nginx config or php code, no new minor versions (PHP 7.3.x or nginx). In Jan 2020, use php 7.4 and we can see a small rise.

Curiously, nginx alone drop in Nov 20, 2019.

msmith-techempower commented 4 years ago

I believe we have an answer to that now.

Nov 20 is when we switched back from CentOS to Ubuntu, and we did not apply (this iptables rule)[https://news.ycombinator.com/item?id=20205566] which was previously applied on the CentOS install.

That dip from Jun 8 to Nov 20 appears to be a direct relation to that particular rule being in place.

joanhey commented 4 years ago

I think that would be a timeline with all that changes in some place.

A chronological history of the changes in a web page.

NateBrady23 commented 4 years ago

https://github.com/TechEmpower/tfb-status/issues/21 Yes, I want that.