TechEmpower / FrameworkBenchmarks

Source for the TechEmpower Framework Benchmarks project
https://www.techempower.com/benchmarks/
Other
7.56k stars 1.94k forks source link

New Citrine Setup Shows Lower Numbers #8397

Closed synopse closed 1 year ago

synopse commented 1 year ago

The new Citrine setup is up and running ! Thanks a lot ! But with this new setup, the results are lower than they were previously.

This is clearly visible on the /plaintext endpoint, for instance. With the previous setup, the top frameworks seem to be limited by the network layer, and were all around 7,000,000 RPS, with a very small margin: the top 16 were within 99-100% difference. https://www.techempower.com/benchmarks/#section=test&runid=273fa177-fc53-43a9-a97d-d6f3f2ade99a&test=plaintext With the new setup, the last two runs show something else: the numbers are 5,700,000 RPS at best, and within a 92-100% range. https://www.techempower.com/benchmarks/#section=test&runid=42049ab2-dacb-4d95-b292-2c2ffdf0a623&test=plaintext

What did occur in-between?

NateBrady23 commented 1 year ago

Interesting. Nothing changed except the machines were off for a while. I'll be in the office on Thursday and start investigating.

joanhey commented 1 year ago

After move the machines, also the first run and the next (actual run) show very different numbers.

Plaintext First run: image Actual run: image

Single Query First run: image Actual run: image

In this test, check the numbers of: atreugo (w/o prefork), fasthttp (w/o prefork), mormot (all variants), asp.net core [platform, pg], kumbiaphp-workerman, ngx-php postgres, lithium-mysql, openresty, ...

All frameworks use different languages and database, and the difference is very big in the actual run.

sebastienros commented 1 year ago

Nothing changed except the machines were off for a while.

I thought it was because it got updated to 22.04, so scratch that. Is it possible that security patches got installed on restart?

joanhey commented 1 year ago

The diff in numbers, from first run and actual run:

Single query First run Actual run
asp.net core [platform, pg] 392,347 337,796
atreugo 344,013  264,916
atreugo prefork 354,397 268,649
fasthttp 330,326 261,255
fasthttp prefork 358,220 264,667
mormot [orm] 359,919 256,946
mormot [direct] 358,800 256,737
kumbiaphp-workerman 312,106 266,121
ngx-php postgres 350,236 261,836
lithium-mysql 316,587 227,726
openresty 266,503 189,251

Some have lost more than a third.

And more frameworks, but some frameworks don't have any difference or have a little gain.

PD: both runs after move the machines.

joanhey commented 1 year ago

The first run, also show non-normal numbers with Workerman in the single query test.

It's the first time than 2 full php frameworks using Workerman are faster than the Workerman platform :thinking: . As Workerman run at the end of the run, perhaps exist a progressive performance degradation. Or a problem in the middle, than later affect the next run also.

image

franz1981 commented 1 year ago

Our quarkus, which is limited on the application side, for plaintext, didn't get any hit. So it should be something related syscalls/IRQ handling or networking settings.

franz1981 commented 1 year ago

It worth checking some wide sys perf data collection I think.

Il mer 30 ago 2023, 21:01 Joan Miquel @.***> ha scritto:

The first run, also show non-normal numbers with Workerman in the single query test.

2 full php frameworks using Workerman are faster than the Workerman platform 🤔 . As Workerman run at the end of the run, perhaps exist a progressive performance degradation.

[image: image] https://user-images.githubusercontent.com/249085/264445261-89e94d3d-4f77-4254-bba4-21d8dacdecd5.png

— Reply to this email directly, view it on GitHub https://github.com/TechEmpower/FrameworkBenchmarks/issues/8397#issuecomment-1699688693, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEENM6Z5Q2EOSFY4ZLHSX3XX6EYNANCNFSM6AAAAAA4DA74HU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

joanhey commented 1 year ago

Json test

Before move the machines image

After move the machines First run image

Actual run image

It's a dance of numbers. Seems a progressive degradation.

MichalPetryka commented 1 year ago

Ubuntu 22.04 moved to OpenSSL 3.0 which is known for severe performance degradation over 1.1 which was used in 20.04. (source: https://github.com/openssl/openssl/issues/20286 https://github.com/openssl/openssl/issues/20715 https://github.com/openssl/openssl/issues/21005 https://github.com/openssl/openssl/issues/21833) The issues could possibly be caused by this (and some frameworks being unaffected could possibly use different TLS libraries). EDIT: this should probably be in https://github.com/TechEmpower/FrameworkBenchmarks/issues/8038 instead, but it's possibly related here too.

fakeshadow commented 1 year ago

Ubuntu 22.04 moved to OpenSSL 3.0 which is known for severe performance degradation over 1.1 which was used in 20.04. (source: openssl/openssl#20286 openssl/openssl#20715 openssl/openssl#21005 openssl/openssl#21833) The issues could possibly be caused by this (and some frameworks being unaffected could possibly use different TLS libraries). EDIT: this should probably be in #8038 instead, but it's possibly related here too.

afaik tfb does not utilize tls in any benchmark category.

franz1981 commented 1 year ago

Collecting flamegraphs out of the docker image could help really..maybe some secomb weirdo on the network syscalls. Json is dominated in perf by send/write cost due to not excercizing pipelining, hence the syscall/irq handling seems the most likely cause to me

NateBrady23 commented 1 year ago

It looks like it could be a cpu throttling problem. It's in a new soundproof rack and there may be too much heat. There was additional foam added after the first run as well. We're looking into it.

sebastienros commented 1 year ago

This should be a new column in the results, "energy consumption". Joke aside it might even be a deciding factor for some companies.

synopse commented 1 year ago

Energy consumption, memory consumption, docker container disk size, building time...

volyrique commented 1 year ago

My impression is that there is a performance degradation across all benchmarks except the cached queries one, in which case the bottleneck is still the available network bandwidth (as before), while the JSON serialization and the plaintext tests seem to be the most severely affected (with the latter surprisingly no longer being limited by network bandwidth). However, there is another major difference between the latest runs and the previous one - the kernel version has changed from Linux 5.15.0-70-generic #77-Ubuntu SMP Tue Mar 21 14:02:37 UTC 2023 to Linux 5.15.0-76-generic #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023, and there were a couple of speculative execution vulnerabilities that became public quite recently and that might be fixed by the newer kernels (but I haven't tried finding a changelog or anything). Unfortunately, even if those vulnerabilities are the culprit, downgrading to an older kernel release might not help because I think that some of fixes also involve changes to the CPU microcode.

fakeshadow commented 1 year ago

At least for xitca-web the regression happens during the first run and the score is relatively similar to the second run. Something must have cause onflight perf regression which is unlikely to be kernel update.

joanhey commented 1 year ago

The benchmark is executed in framework name alphabetical order. In the first run, the last frameworks have the performance regression: xitca-web, workerman, ....

The second run affect almost all frameworks, and the third run (actual) is again a disparate change of results.

NateBrady23 commented 1 year ago

In the first run, the last frameworks have the performance regression: xitca-web, workerman, ....

Yeah, this is where we added the extra soundproofing because of the new office space. Not really sure what to do here yet. Talking to the team.

synopse commented 1 year ago

cpu throttling problem

Seems like a really plausible solution. Especially since the slow down follows the alphabetical order, and "asp dotnet" was reaching its usual 7,000,000 RPS at the first run (when the machine was cold), then 5,600,000 RPS at the second run, and 5,300,000 RPS during the current run. On the same HW and SW configuration.

volyrique commented 1 year ago

As another data point, h2o also experienced the performance regression immediately during the first run after the server move, even though it executes before the midpoint of a run (around number 300), and its results stayed pretty much the same during the second run except for the JSON serialization test, which regressed further by 4.52%.

synopse commented 1 year ago

As another data point, h2o also experienced the performance regression

This is the point of CPU throttling: it is pretty erratic and never consistent. If the speed decrease was consistent (say always at 20%) then it would not hurt much the benchmark results. But since the CPU frequency could go up and down at any time, the whole idea of ranking and benchmarking becomes unfair.

For the benchmark to work in a business-like environment, we expect the servers to run at their best potential, as any high-end server HW setup should in the real world. Otherwise, no company would spent so much money in the latest professional computer and network components.

franz1981 commented 1 year ago

This seems a duplicate of https://github.com/TechEmpower/FrameworkBenchmarks/issues/8038

synopse commented 1 year ago

This seems a duplicate of #8038

No, it is not. #8038 is about the SW upgrade of the OS, and only for some frameworks which may be affected by latest security mitigation patches, whereas this issue appeared afterwards, when the HW was moved to a new location.

After the SW upgrade, the results where still consistent between runs, whereas with the HW relocation, the results are pretty inconsistent between runs. So this issue is clearly something else, and need to be cleared because CPU performance consistency is a need for any fair benchmark. ;)

franz1981 commented 1 year ago

After the SW upgrade, the results where still consistent between runs,

Not agree here, depends on the framework. For VertX, despite no version change, there was quite variance,l (for plaintext) even if not as bad as now. You can check youraelf for vertx; I didn't tracked other Framework actually (but quarkus, which wasn't affected).

synopse commented 1 year ago

My guess is that if it affects only one or few frameworks, it may be more like a framework or RTL stability issue. If some frameworks are stable with high performance, they all other should be too. Or at least, when all frameworks have inconsistent numbers, then there is a true problem on the setup they run on. ;)

franz1981 commented 1 year ago

when all frameworks have inconsistent numbers, then there is a true problem on the setup they run on. ;)

It depends how much rooms different framework have to spare based on the type of test. For example, the top ones plaintexts were able to max out (and beyond, likely, looking the dstat CPU usage) the network interface, still keeping CPU usage below 100%. Let's say that in the early stage of the performance degradation, the issue (we still didn't discovered the problem) was still unable to make them consume the full CPU usage the servers, their total throughout were still enough to max out the NIC and look reasonably good. But, in the later stage, if the problem has gotten worse, and the available CPU to the server was making such frameworks unable to max out the NIC, it has become more evident there as well.

VertX was already maxing out the CPU, without maxing out the NIC, hence given that was having much less available room to absorb a cpu perf hit, the regression was more evident, immediately. This is just an example, but made to make clear that if the problem is a CPU related one, what we observe now for the top tiers frameworks (and all the others) could be the effect of a performance degradation process started way before, but, depending on the characteristics of the framework or the test, was looking like a random noise for specific frameworks....and has passed unnoticed.

jordanbray commented 1 year ago

I'm just chiming in to say I hope it's the sound proofing. Not on any technical grounds - it would just make the best story.

NateBrady23 commented 1 year ago

Ok, we got the temp way down 6pm PT on Sept 7. I haven't been able to get in to the office to take a look at some /proc or /sys cpu info, but hopefully we see a correction from here.

Edit: Faf numbers came in at 7m RPS for plaintext. Seeming likely it's the story @jordanbray wants ;)

Edit: All frameworks starting with "f" and after, are starting to look like results pre-move.

volyrique commented 1 year ago

@nbrady-techempower I still see a roughly 4% regression in the single query, multiple queries, and database updates results for h2o (but, curiously, not in the fortunes test) - have the temperature fixes also been applied to the database server? Other high performers such as just-js and may-minihttp are similarly affected.

Alternatively, the cause might be the potential security fixes that I mentioned and that people were sceptical about having a particular effect on the database server. If that is the case, then it makes sense that the fortunes test is affected the least because it requires the greatest amount of processing on the application server.

NateBrady23 commented 1 year ago

Alternatively, the cause might be the potential security fixes that I mentioned and that people were sceptical about having a particular effect on the database server. If that is the case, then it makes sense that the fortunes test is affected the least because it requires the greatest amount of processing on the application server.

This makes sense to me, as the temp is fine for all the servers now.

synopse commented 1 year ago

It sounds to me that performances are likely to be stable now.

Perhaps we could wait for a new whole round, then if there is no slowdown any more in respect to the numbers before the Citrine move, it will be time to close this issue.

Thanks a lot for taking the time to find the cause and fix it physically. :) And sorry for the noise in your office, due to the non-so-sound-proof-Citrine servers!

NateBrady23 commented 1 year ago

And sorry for the noise in your office, due to the non-so-sound-proof-Citrine servers!

hahaha it's not too bad, but I'm not physically in the office most of the time so it's easy for me to say; if you knew where they were before 😱

Agreed on waiting another full round to consider this closed, and then we can set a lockdown date for the next round. Hopefully that's the end of the interruptions!

KostyaTretyak commented 1 year ago

Could this issue be related to resource exhaustion of SSD overwrite cycles and the corresponding garbage collection process?

synopse commented 1 year ago

@KostyaTretyak I don't think so, because there are a few writes to the disk, in terms of volume. And the current state, after removing sound-proofing hardware, gives consistent numbers.

@nbrady-techempower The last round just finished. https://www.techempower.com/benchmarks/#section=test&runid=57f6d119-5d08-4716-a2d8-8665d10839d3&test=composite Numbers seems coherent with the performance observed before the servers move.

My guess is that I could close this issue. Perhaps one step toward https://github.com/TechEmpower/FrameworkBenchmarks/issues/7475