TechEmpower / FrameworkBenchmarks

Source for the TechEmpower Framework Benchmarks project
https://www.techempower.com/benchmarks/
Other
7.59k stars 1.94k forks source link

Framework tests can negatively impact other tests in the same framework #1228

Closed hamiltont closed 9 years ago

hamiltont commented 9 years ago

For example, the plaintext test runs at large concurrency and often causes servers to either completely a) hang or b) begin massively underperforming (but still responding to small checks, like curling the URLs). This causes big problems for other test types json,single,multiple,fortune, etc if plaintext is run first - all of the following tests either don't complete or have bad numbers.

This is a nasty issue because it's subtle to notice (no big errors), has major performance implications, and impacts TFB's ability to add future test types. Given that many of our tests intentionally push servers to failure, this is definitely a problem we want to solve categorically.

Only hard solution I see is to isolate tests from each other, such as turning our current pattern of this:

start_server --> db--> json --> plaintext --> fortune --> etc --> stop_server

into

start_server --> db --> stop_server --> start_server --> json --> stop_server --> etc

This allows us to use things like Benchmarker#__forciblyEndPortBoundProcesses to ensure the server is totally reset between each test. However, this would drastically increase the testing time

msmith-techempower commented 9 years ago

Additionally, __forciblyEndPortBoundProcesses doesn't seem to always succeed currently; we will rely on that before this next issue.

msmith-techempower commented 9 years ago

Ugh, more hell. Plaintext can reliably get otherwise reasonable frameworks to completely hang in amazing ways. bottle is the one I've been using to test, but basically we can never kill these processes because they do not have a PPID, so we have no idea if they should be terminate.

hamiltont commented 9 years ago

I think this is the right solution, and needs to be done between R10 and R11. Let's be clear -- testrunner is not the fundamental reason this is a solution -- the fundamental thing testrunner account would do is remove the ability for frameworks to use sudo. If frameworks could still use sudo inside the testrunner account, then this new user account won't solve many of the the problems we are having. If frameworks cannot use sudo inside testrunner we will suddenly see lots of frameworks failing to run. This will be a huge PITA to fix, but I truly hope that every framework we test can be launched without requiring sudo (if one cannot, should we exclude them?)

If frameworks cannot use sudo, then most of the methods we've proposed or tried (recursively search ppid, putting them inside a different user account so we can track processes, searching for processes bound to ports that should be open, etc) would work reliably. But we do need a different account to do this, because TFB needs to run with permissions to install software

hamiltont commented 9 years ago

Thought about this some more.

FYI the precise problem is here - we run test after test against a framework without restarting the framework in between.

I'd expect that changing the order of the tests could result in substantial differences for some underperforming frameworks. This is obviously an issue with those frameworks - if a framework starts to underperform after experiencing a period of heavy load, that's a huge problem.

We could help frameworks avoid this challenge by restarting the framework in between each test time, although this would increase our total benchmark running time from 1 day to 6 days.

Conclusion: This isn't TFB's battle, and if a framework starts to fail after experiencing load that is something that should be reflected in the results. I vote to close this issue as wontfix. Thoughts?

msmith-techempower commented 9 years ago

Ugh... this again.

There are two prevailing schools of thought on the issue I mentioned above (plaintext can reliably get otherwise reasonable frameworks to completely hang in amazing ways):

  1. Nuke everything between test iterations (there are essentially 5 plaintexts run, 5 dbs, etc at different concurrencies) and reissue another small priming run before each iteration.
  2. It's at the fault of the framework/stack author, and they should resolve it.

There are MANY problems with the first, but they almost entirely boil down to "more work for TE et al" whereas the problems with the second are many and unfair. For instance, many of the Python frameworks which suffer from this problem do not seem to be stuck in actual framework code, but rather somewhere in the netstack in a Python library. While this does show a limitation of choosing such a framework, it's ultimately something in the stack outside the control of the author... and I feel like punishing the framework for a shortcoming of the language is rather punitive.

That said, it is somewhat infeasible for us to run rounds if they they take 6 days to complete. We cannot justify the cost of running Amazon EC2 instances for that long on this type of project (currently, we spin them up for the first preview and the final run, then immediately shut them down once we've captured their logs).

Sigh.

Let's remember this issue exists for future discussions on the matter, but I'm inclined to agree with @hamiltont; let's close it as wontfix for now.