Closed Karm closed 7 years ago
Docker intercepts networking and actively blocks datagrams >1500 bytes or so, see https://github.com/docker/docker/issues/8357
What I can suggest is lower net.bufsize
, see http://knot-resolver.readthedocs.io/en/latest/daemon.html#c.net.bufsize (something like 1400B and check docker logs).
@vavrusa Hi Marek,
I would like to share that I found a workaround for the aforementioned issue.
I used to bind to 0.0.0.0 and I noticed in strace
log that there had been lags between reading from particular interfaces, e.g. lo (nothing for dozens of milliseconds), eth0 (lag again), ethwe (lag again). Note that it happened only in a Docker container, not on a normal VM.
What I tried was to use Lua script to iterate over available interfaces, picking just a single one, effectively binding to one and only one interface within the container. See code.
nicname = env.SINKIT_KRESD_NIC
if nicname == nil or nicname == '' then
nicname = "eth0"
end
for name, addr_list in pairs(net.interfaces()) do
if name == nicname then
print("Found interface " .. nicname)
net.listen(addr_list)
else
print("Not using interface " .. name)
end
end
The average resolution time per domain on Alexa Top 2000 with an empty cache went from ~1100ms to ~300ms.
To answer your obvious question, yes, I'm sure that this patch in the config file is the only thing that changed between perf tests.
Could you share your opinion on the issue? I am unable to exactly explain why it helped so tremendously and I am very curious to know the technical details.
Cheers -K-
I don't know enough about virtualised networking in Docker, maybe you should ask there, but it has gotchas (like the MTU bug and issues with v6, maybe it exposes interfaces that application shouldn't bind to). I suspect the problem might also be with IPv6, you might want to disable it in config if you're running inside docker instance. If you can document the gotchas into documentation and submit a PR, it would be very helpful.
I can confirm that disabling IPv6 on its own doesn't help; limiting to one and only one interface does though. I'll prepare a PR for a documentation "tooltip".
Thanks!
This is already documented in http://knot-resolver.readthedocs.io/en/stable/daemon.html , closing.
Dear CZ-NIC fellows,
I'm writing you on behalf of Whalebone organization regarding a performance issue I have been experiencing with Knot Resolver in Docker container.
Without Docker
If I clone and compile Knot-resolver's master on a 2-cores Fedora 24 VM with the usual default -O2 and run it with the default config.personal, I can easily get average domain resolution time per domain around ~300ms; having cold cache on start and using a 250 records long list of top Alexa domains.
With Docker
When I grab either your Docker image based on Alpine Linux or my own Docker image based on Fedora 24 and run it on the very same, aforementioned, Fedora 24 VM with 2 cores, (docker 1.10.3, build 19b5791/1.10.3) I cannot get under ~1100ms average resolution time per single domain (the same aforementioned list).
This is not any kind of weird stress test, I use the ancient namebench 1.3 where a 1 thread queries the resolver one record at a time.
Expected results and Unbound
Unbound resolver performs virtually the same, regardless of whether it's being run as a Docker container or a plain process on the same host.
Debugging
CPU cycles consumption nor memory comes into the play, everything seems to be still. There is nothing evil going on in iotop and despite Valgrind Callgrind shows hot spots in Knot related to
ld
and symbols lookup for Lua; it appears to have no connection to the problem on hand. I suspect that Knot treats sockets in a way that's hard to swallow for my kernel's (4.6.3-300.fc24.x86_64) networking stack while operating Docker namespaces. I did try both with and without various-f
settings andSO_REUSEPORT
, noting seems to help Knot's performance in Docker.