Closed zloster closed 4 years ago
Actually the above problem seems to be bigger. I've checked the logs from the last tree tfb-status
runs: two from Citrine and one from Azure(round 17 branch).
# Download and unzip the result.zip files
grep -r "unable to connect to" ./results/20180925161627/ > problems.txt
There is a lot of stuff with the following Cannot assign requested address
:
./results/20180925161627/ffead-cpp-apache-mysql/query/raw.txt:unable to connect to 10.0.0.1:8080 Cannot assign requested address
./results/20180925161627/ffead-cpp-apache-mysql/json/raw.txt:unable to connect to 10.0.0.1:8080 Cannot assign requested address
./results/20180925161627/ffead-cpp-apache-mysql/json/raw.txt:unable to connect to 10.0.0.1:8080 Cannot assign requested address
./results/20180925161627/api_hour-mysql/fortune/raw.txt:unable to connect to 10.0.0.1:8080 Cannot assign requested address
./results/20180925161627/api_hour-mysql/fortune/raw.txt:unable to connect to 10.0.0.1:8080 Cannot assign requested address
./results/20180925161627/api_hour-mysql/update/raw.txt:unable to connect to 10.0.0.1:8080 Cannot assign requested address
If the server application is not responding we should see only Connection refused
:
./results/20180925161627/wizzardo-http/plaintext/raw.txt:unable to connect to 10.0.0.1:8080 Connection refused
./results/20180925161627/tornado-postgresql-raw/db/raw.txt:unable to connect to 10.0.0.1:8080 Connection refused
./results/20180925161627/tornado-postgresql-raw/db/raw.txt:unable to connect to 10.0.0.1:8080 Connection refused
./results/20180925161627/tornado-postgresql-raw/db/raw.txt:unable to connect to 10.0.0.1:8080 Connection refused
./results/20180925161627/tornado-postgresql-raw/db/raw.txt:unable to connect to 10.0.0.1:8080 Connection refused
./results/20180925161627/tornado-postgresql-raw/db/raw.txt:unable to connect to 10.0.0.1:8080 Connection refused
./results/20180925161627/tornado-postgresql-raw/db/raw.txt:unable to connect to 10.0.0.1:8080 Connection refused
./results/20180925161627/tornado-postgresql-raw/db/raw.txt:unable to connect to 10.0.0.1:8080 Connection refused
In this particular case the JVM has died while running wizarrdo-http
.
The servlet3.raw log from the same run (20180925) is nice illustration of the problem:
---------------------------------------------------------
Running Primer plaintext
wrk -H 'Host: 10.0.0.1' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 5 -c 8 --timeout 8 -t 8 http://10.0.0.1:8080/plaintext
---------------------------------------------------------
unable to connect to 10.0.0.1:8080 Cannot assign requested address
---------------------------------------------------------
Running Warmup plaintext
wrk -H 'Host: 10.0.0.1' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 512 --timeout 8 -t 28 http://10.0.0.1:8080/plaintext
---------------------------------------------------------
Running 15s test @ http://10.0.0.1:8080/plaintext
28 threads and 512 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.06ms 4.12ms 204.58ms 96.60%
Req/Sec 4.49k 540.59 10.67k 88.42%
Latency Distribution
50% 3.86ms
75% 5.64ms
90% 7.23ms
99% 8.79ms
1883257 requests in 15.10s, 232.02MB read
Requests/sec: 124729.70
Transfer/sec: 15.37MB
---------------------------------------------------------
The primer
doesn't produce result but the first warmup is OK?!
Another manifestation of the same problem are the logs of various frameworks that are complaining with the same Cannot assign requested address
message when trying to talk to the database. I've used grep -r "Cannot assign requested address" ./results/20180925161627/ > problem-db-25.txt
@nbrady-techempower @michaelhixson This seems to be a serious problem.
This seems to suggest that sometimes the, presumably non-blocking, docker shutdown of a container outlasts the next docker start, so trying to bind with the same hostname (or port, or address) fails?
@nbrady-techempower @michaelhixson Is it possible to reproduce locally? If so, can we add some checking similar to how we wait until curl
responds from the server before starting the primer? I no longer fear waiting forever on an attempted shutdown of an image since we are relying on Docker instead of TFBReaper.
Something like:
curl
on the app server until it fails; not sure about how to handle the database and client containers)Would it be possible to run an experiment where we map each docker container to a different port over the course of the test. eg the first one mapps 8080:8081 then the next 8080:80802 etc. The load tester would need to know where each test is moving to but it would eliminate any possibility that the last "old" container could impact the next test. The results may be more stable.
@nathantippy If the old container is up and running and using resources then the current test is still impacted. Changing the ports would make that harder to see. At least with the ports the same, we see the collision and know we have a problem with those tests.
I think we caught the rest of these issues in #4585. This should be resolved.
@nbrady-techempower My quick check shows that Cannot assign...
is still a problem. See the Gemini logs for example. The latest finished run in Citrine is: https://tfb-status.techempower.com/results/131672ca-061d-4cbc-b68e-c2d4a2018cf4 It includes the mentioned commit for the 4585.
Scroll down (quite a lot or search for Cannot assign requested address
) to skip the logs about closing connections:
gemini-postgres: GH 06:01:26 jdCA: none - 702171 - [c266;t-1] Closing connection profile 266. ... ... ... gemini-postgres: GH 06:04:57 jdCA: none - 41596493 - [c50226;t-1] Closing connection profile 50226. gemini-postgres: GH 06:04:57 jdCA: none - 41612616 - none - 41612616 - [c50464;t-1] SQL Exception while connecting. org.postgresql.util.PSQLException: Connection to tfb-database:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections. gemini-postgres: at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:280) gemini-postgres: at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) gemini-postgres: at org.postgresql.jdbc.PgConnection.
(PgConnection.java:195) gemini-postgres: at org.postgresql.Driver.makeConnection(Driver.java:454) gemini-postgres: at org.postgresql.Driver.connect(Driver.java:256) gemini-postgres: at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:677) gemini-postgres: at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:228) gemini-postgres: at com.techempower.data.jdbc.JdbcConnectionProfile.establishDatabaseConnection(JdbcConnectionProfile.java:180) gemini-postgres: at com.techempower.data.jdbc.JdbcConnectionManager.createConnectedProfile(JdbcConnectionManager.java:597) gemini-postgres: at com.techempower.data.jdbc.JdbcConnectionManager.addNewProfile(JdbcConnectionManager.java:195) gemini-postgres: at com.techempower.data.jdbc.JdbcConnectionManager.getAnyAvailableProfile(JdbcConnectionManager.java:464) gemini-postgres: at com.techempower.data.jdbc.JdbcConnectionManager.getProfile(JdbcConnectionManager.java:404) gemini-postgres: at com.techempower.data.jdbc.JdbcConnectionManager.getConnectionMonitor(JdbcConnectionManager.java:495) gemini-postgres: at com.techempower.data.jdbc.BasicConnectorFactory.getConnectionMonitor(BasicConnectorFactory.java:356) gemini-postgres: at com.techempower.data.EntityGroup.get(EntityGroup.java:417) gemini-postgres: at com.techempower.cache.EntityStore.get(EntityStore.java:775) gemini-postgres: at hello.home.handler.HelloHandler.multipleQueries(HelloHandler.java:74) gemini-postgres: at hello.home.handler.HelloHandlerMethodAccess.invoke(Unknown Source) gemini-postgres: at com.techempower.gemini.path.MethodSegmentHandler.dispatchToAnnotatedMethod(MethodSegmentHandler.java:493) gemini-postgres: at com.techempower.gemini.path.MethodSegmentHandler.handle(MethodSegmentHandler.java:388) gemini-postgres: at com.techempower.gemini.path.PathDispatcher.dispatch(PathDispatcher.java:231) gemini-postgres: at com.techempower.gemini.GeminiApplication.handleRequest(GeminiApplication.java:1092) gemini-postgres: at com.techempower.gemini.GeminiApplication.doRequest(GeminiApplication.java:1036) gemini-postgres: at com.techempower.gemini.transport.InfrastructureServlet.service(InfrastructureServlet.java:168) gemini-postgres: at javax.servlet.http.HttpServlet.service(HttpServlet.java:97) gemini-postgres: at com.caucho.server.dispatch.ServletFilterChain.doFilter(ServletFilterChain.java:109) gemini-postgres: at com.caucho.server.webapp.WebAppFilterChain.doFilter(WebAppFilterChain.java:156) gemini-postgres: at com.caucho.server.dispatch.ServletInvocation.service(ServletInvocation.java:290) gemini-postgres: at com.caucho.server.http.HttpRequest.handleRequest(HttpRequest.java:838) gemini-postgres: at com.caucho.network.listen.TcpSocketLink.dispatchRequest(TcpSocketLink.java:1362) gemini-postgres: at com.caucho.network.listen.TcpSocketLink.handleRequest(TcpSocketLink.java:1318) gemini-postgres: at com.caucho.network.listen.TcpSocketLink.handleRequestsImpl(TcpSocketLink.java:1302) gemini-postgres: at com.caucho.network.listen.TcpSocketLink.handleRequests(TcpSocketLink.java:1210) gemini-postgres: at com.caucho.network.listen.TcpSocketLink.handleAcceptTaskImpl(TcpSocketLink.java:1006) gemini-postgres: at com.caucho.network.listen.ConnectionTask.runThread(ConnectionTask.java:117) gemini-postgres: at com.caucho.network.listen.ConnectionTask.run(ConnectionTask.java:93) gemini-postgres: at com.caucho.network.listen.SocketLinkThreadLauncher.handleTasks(SocketLinkThreadLauncher.java:169) gemini-postgres: at com.caucho.network.listen.TcpSocketAcceptThread.run(TcpSocketAcceptThread.java:61) gemini-postgres: at com.caucho.env.thread2.ResinThread2.runTasks(ResinThread2.java:173) gemini-postgres: at com.caucho.env.thread2.ResinThread2.run(ResinThread2.java:118) gemini-postgres: Caused by: java.net.ConnectException: Cannot assign requested address (connect failed) gemini-postgres: at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) gemini-postgres: at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:400) gemini-postgres: at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:243) gemini-postgres: at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:225) gemini-postgres: at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:402) gemini-postgres: at java.base/java.net.Socket.connect(Socket.java:591) gemini-postgres: at org.postgresql.core.PGStream. (PGStream.java:70) gemini-postgres: at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:91) gemini-postgres: at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:192) gemini-postgres: ... 39 more
This may be a problem with the Gemini tests. My concern before was that one framework’s failures was causing another framework problems. I don’t think that’s the case here. I’ll dig in a little more tomorrow.
@nbrady-techempower I've checked some of the dstat
logs, particularly the TCP sockets in TIME_WAIT
state. There are too many of them in some test runs.
wget https://tfb-status.techempower.com/raw/results.2019-04-15-14-00-12-923.zip
unzip results.2019-04-15-14-00-12-923.zip
find -name "*.txt.json" | parallel --bar "jq '.[] | .[] | .\"tcp sockets\".tim'" {} > tim.out
cat tim.out | uniq | sort -rn | head -30
Sample result:
xxx@yyy:~/tmp/tfb-results$ cat tim.out | uniq | sort -rn | head -30
131433
131409
131303
131290
131264
131236
131192
131168
131154
131139
131129
131083
131077
131071
131068
131067
131065
131022
130890
130642
128423
128076
123837
123626
120629
120519
115872
110989
108184
95087
And some quick and dirty reverse check:
xxx@yyy:~/tmp/tfb-results$ grep "tim\": 131433" -R .
./results/20190412010730/aspnet-mono-ngx-my/db/stats.txt.json: "tim": 131433.0,
xxx@yyy:~/tmp/tfb-results$ grep "tim\": 95087" -R .
./results/20190412010730/rack-sequel-postgres-passenger-mri/json/stats.txt.json: "tim": 95087.0,
The corresponding raw logs look suspicious:
Before the Docker
migration there were tweaks for the kernel TCP stack. I don't know where to check currently. Note that tcp_tw_recycle
is not a valid option anymore (via some comment on StackOverflow).
Maybe it is a good idea to add an export of sysctl -a
to the zip file with the results. Should be easier to spot configuration differences.
@msmith-techempower Mike, do you have any time to look at this, perhaps next week?
I have some time today; I'll see what I can do.
Reopening this issue since it was never resolved, as far as I know.
I looked into this a while ago and never figured it out. I had a hypothesis that I never tested that went like this: (Warning: wild speculation follows)
Since we use Docker's --network host
mode, frameworks will consume system-level resources for connections that are visible when running lsof
on the host. These connections stick around for some time after the Docker container that opened the connections is stopped. They sometimes stick around long enough that they're still there when the next framework is being tested. If there are enough connections like this, we hit some host-level maximum and start seeing errors like "cannot assign requested address". Eventually the OS cleans up these connections so that not every other framework is affected.
My hypothesis is that it is something about the kernel TCP stack settings. I've touched that in my previous comment. Something similar to this. That's why I've suggested the sysctl -a
export.
Here the container is not restarted but the framework sometimes stops responding, sometimes responds (in bulk). Similar behaviour as described on the ServerFault.
Another data point is the TCP sockets during the test runs particularly in the plaintext
test. All data is from the latest finished run: https://tfb-status.techempower.com/results/aa3d1cda-5cb6-489f-bd01-1e3718ca4d8e. The graphs are from here: https://lamada.eu/dstat-graph/.
ulib-stats.txt
(label on the left) shows how it should be: two primer runs with four benchmarks runs. The Sockets
and TCP sockets
stats are agreeing with each other. We clearly see the 4096 and 16384 concurrency.
lib-reactor-stats.txt
is OK also.
actix-raw-stats.txt
shows it is not able to achieve 16384 concurrency for about half of the run. Sockets
and TCP sockets
stats still agreeing.
asp-core-rhtx-stats.txt
shows it is not able to achieve 16384 concurrency about 3/4 of the run. Sockets
and TCP sockets
are NOT agreeing with each other.
asp-core-stats.txt
shows it is achieving about 10-11K concurrency. Again the two socket stats are NOT agreeing with each other.
These are 6 frameworks from the TOP10 in the plaintext
test.
At the end are servlet-stats.txt
and servet-lib-native-stats.txt
. The first is starting at 5300 and finishes around 7400 concurrency. The second one starts at 5200 and finishes at around 5400. If there are some limitations in the configuration of the framework I'm expecting to see flat line for this parameter for the majority of the time (after the rise in the beginning). This is connected with the #4808
I've just remembered about this case with Servlet3
(Tomcat 9). It was in the beginning of April this year. We are interested in servlet3
and servlet3-sync
.
The difference between the two is this change: tuning of the Tomcat HTTP connector. The new configuration (maxKeepAliveRequests="-1"
) stops the closing of the HTTP connection after 100 requests (or something similar, I don't remember the default value). But closing of the HTTP connection also should close the TCP/IP connection. And after many closed TCP connections there comes a moment when no new connections are established - 0
RPS at some cases.
@zloster Sorry, I totally forgot about this. I'm super busy today, but I'll try and make sure this is a priority before the next round.
Here is the sysctl -a
in the meantime from the app server. They should all be configured the same. Let me know if you spot something.
We looked into the "Cannot assign requested address" issue and found that the issue was caused by insufficient ephemeral ports, or too many connections, on the VMs when running certain frameworks. Since most frameworks do not have this issue and are behaving within spec as expected, the fix/mitigation we decided to implement was to add a 60-second wait time between each permutation to minimize the lingering impact of occupied ports from the previous permutation to the next.
60 seconds seems to be the amount of time for lingering TIME_WAIT
connections to be timed out and closed per https://github.com/torvalds/linux/blob/master/include/net/tcp.h#L121-L122 and our own observation.
We think the maintainers of test implementations having this issue should treat it as a potential bug in that implementation and/or framework.
Looking at the results from a recent Citrine run (2020/04/01), only the following frameworks were having the "Cannot assign requested address" issue: cowboy
gemini-mysql
gemini-postgres
php-raw7-tcp
spyne-raw
By explicitly specifying the connection: keep-alive
header, wrk
should reuse the connections and thus the amount of opened connections should stay at around the specified -c
number for each test. And we are seeing 512 or so for each test (using undertow
as our subject of a normally behaved framework), or slightly more amid lingering connections from the previous test. However for cowboy
, as an example, we're seeing upwards to 28,000+ connections, which hits the server's port limit. They're staying at the TIME_WAIT
state waiting to be closed.
We tinkered around and found that setting net.ipv4.tcp_tw_reuse
to 1
would prevent the error from happening. However we worry changing this value would potentially mask real problems and make our benchmarking environment less realistic.
So in the end, we've decided to add a 60-second wait between each permutation to make sure the TIME_WAIT
connections are closed to minimize impacts to the next set of tests.
In the last run, a new framework with that problem is Phalcon-micro: https://tfb-status.techempower.com/unzip/results.2020-04-20-07-27-56-665.zip/results/20200415180849/phalcon-micro/run/phalcon-micro.log
Closing in favor of new issue.
We looked into the "Cannot assign requested address" issue and found that the issue was caused by insufficient ephemeral ports, or too many connections, on the VMs when running certain frameworks.
@jsongte Thank you for so easily dismissing someone's effort. It's not fair. You HAVE CONFIRMED that the issue IS CAUSED BY "ephemeral ports" - see my comment above where I've requested information and suggested what should be checked. Almost 6 (six) MONTHS the updates here WERE IGNORED by the TechEmpower team.
We tinkered around and found that setting
net.ipv4.tcp_tw_reuse
to1
would prevent the error from happening. However we worry changing this value would potentially mask real problems and make our benchmarking environment less realistic.
Than there should be VERY good elaboration why before migration to Docker containers net.ipv4.tcp_tw_reuse
to 1
was OK and now it isn't.
@zloster I take full responsibility for not pulling other people into this issue sooner. When we finally got around to it, we took Citrine down for a week, solely for the purpose of working on this. We had internal discussions while we were trying several different approaches that were all related to the guidance you provided. If we gave the impression that your efforts weren't valuable, it was unintentional. The time and effort you put into these more difficult problems is truly appreciated by all of us!
As far as the net.ipv4.tcp_tw_reuse
to 1
before and after Docker, I'm not entirely sure whether it is OK or not OK. I think we're just trying our best to stick with defaults where we can while making sure errors from frameworks aren't swallowed by configuration changes.
I've noticed that:
hexagon-undertow-mongodb
hexagon-undertow-postgresql
are having problems. The following message is visible in the
raw.txt
:Here are some example logs from
citrine
:The
.dockerfiles
are looking OK i.e. no obvious differences.