Open dotwaffle opened 5 years ago
Actually, I think this might be our old friend MADV_FREE
which is default in Go1.12 and later. The kernel will free the memory under memory pressure. Unfortunately, that is screwing up my alerting...
I'm going to try with GODEBUG=madvdontneed=1
as an environment variable to see if that stops it. I'm still curious why the memory usage massively increased at those two times though.
Let me know what you find! Thanks for the investigation! I will take a look as well.
It's been 4 hours and the memory usage is reasonable still. I'll keep it running for 24 hours to see whether it spikes overnight. Thanks!
Ahah! So, running overnight, memory usage was low and stable with GODEBUG=madvdontneed=1
set. However, I stopped the RIPE NCC rpki-validator-3, and about 10 minutes later memory usage ballooned to nearly 4GB.
Here's the relevant logs:
time="2019-11-15T00:40:28Z" level=info msg="File http://ripe-rpkivalidator3:8080/api/export.json is identical to the previous version"
[... run docker service scale rpki_ripe-rpkivalidator3=0 rpki_cf-octorpki=0 ...]
time="2019-11-15T00:41:28Z" level=error msg="Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:41:28Z" level=error msg="Error updating: Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:42:28Z" level=error msg="Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:42:28Z" level=error msg="Error updating: Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:43:28Z" level=error msg="Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:43:28Z" level=error msg="Error updating: Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:44:28Z" level=error msg="Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:44:28Z" level=error msg="Error updating: Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:45:28Z" level=error msg="Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:45:28Z" level=error msg="Error updating: Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:46:28Z" level=error msg="Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:46:28Z" level=error msg="Error updating: Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:47:28Z" level=error msg="Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:47:28Z" level=error msg="Error updating: Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:48:28Z" level=error msg="Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:48:28Z" level=error msg="Error updating: Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
[... sometimes after this point, but before 00:49:30, memory balloons ...]
time="2019-11-15T00:49:26Z" level=info msg="Accepted tcp connection from 10.255.0.3:44954 (1/0)"
time="2019-11-15T00:49:26Z" level=error msg="Error unexpected EOF"
time="2019-11-15T00:49:26Z" level=info msg="Disconnecting client 10.255.0.3:44954 (v0) / Serial: 0"
time="2019-11-15T00:49:26Z" level=info msg="Accepted tcp connection from 10.255.0.3:34520 (1/0)"
time="2019-11-15T00:49:26Z" level=error msg="Error unexpected EOF"
time="2019-11-15T00:49:26Z" level=info msg="Disconnecting client 10.255.0.3:34520 (v0) / Serial: 0"
time="2019-11-15T00:49:28Z" level=error msg="Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:49:28Z" level=error msg="Error updating: Get http://ripe-rpkivalidator3:8080/api/export.json: dial tcp: lookup ripe-rpkivalidator3 on 127.0.0.11:53: no such host"
time="2019-11-15T00:49:28Z" level=info msg="Accepted tcp connection from 10.255.0.3:49090 (1/0)"
time="2019-11-15T00:49:29Z" level=error msg="Error unexpected EOF"
time="2019-11-15T00:49:29Z" level=info msg="Disconnecting client 10.255.0.3:49090 (v0) / Serial: 0"
time="2019-11-15T00:49:30Z" level=info msg="Accepted tcp connection from 10.255.0.3:33408 (1/0)"
time="2019-11-15T00:49:31Z" level=error msg="Error unexpected EOF"
time="2019-11-15T00:49:31Z" level=info msg="Disconnecting client 10.255.0.3:33408 (v0) / Serial: 0"
time="2019-11-15T00:49:32Z" level=info msg="Accepted tcp connection from 10.255.0.3:40098 (1/0)"
The conditions seem to be:
verify=false
, checktime=false
) -- stopping octorpki didn't see the same ballooning memory usage, and that has verify=true
. This may be a false report condition as the gortr
attached to octorpki
does not have those "Disconnecting client" messages.Memory is never released afterwards, even with MADV_DONTNEED
instead of MADV_FREE
, and in fact gets higher a little. IMO, there's definitely some pointer being held somewhere that is preventing the data from being freed.
For reference, the memory ballooning instance has:
ports:
- target: 8081 # metrics
published: 8181
protocol: tcp
mode: ingress
- target: 8082 # tls
published: 8182
protocol: tcp
mode: ingress
- target: 8022 # ssh
published: 8122
protocol: tcp
mode: ingress
- target: 8023 # raw
published: 8123
protocol: tcp
mode: ingress
whereas the non-ballooning gortr has:
ports:
- target: 8081 # metrics
published: 8281
protocol: tcp
mode: ingress
- target: 8082 # tls
published: 8282
protocol: tcp
mode: ingress
- target: 8022 # ssh
published: 8222
protocol: tcp
mode: ingress
- target: 8023 # raw
published: 8223
protocol: tcp
mode: ingress
It's not clear from the logs which port is being connected to, but I imagine 8181 (metrics) and 8122 (ssh) are more port-scanable than others. Unfortunately, the true source-IPs are hidden because Docker Swarm proxies the connection so it always looks like it comes from private space :(
Let me know if there's anything else I can provide, including the docker-compose.yml
stack file if you want to debug yourself.
Thank you so much for all the details.
Do you mind giving me the -version
details?
My docker version is sha256:a62e33b75ec47b4ca288dbd3e93d48307d1b0d03f9288220f30277b58f59cbb7
and cannot fetch the other one:
$ docker run -ti cloudflare/gortr@sha256:694635c16932987185a3d8d1056ef5ae287e799e7d36981a573d8baa8fc1e752
Unable to find image 'cloudflare/gortr@sha256:694635c16932987185a3d8d1056ef5ae287e799e7d36981a573d8baa8fc1e752' locally
docker: Error response from daemon: received unexpected HTTP status: 500 Internal Server Error.```
I have the same image as you, I just accidentally took the Id rather than the repo digest ;)
I'm just doing some playing around in Docker, and I noticed what could be a memory leak.
It appears to be using 8.6GB of memory after running overnight!
No valid client has ever connected to the daemon, which is running in a Docker container, as is the validator. There have been some invalid connection attempts (presumably port scans):
These do not correlate with the times memory usage jumped: (times are UTC+11 in this graph, whereas all timestamps elsewhere are in UTC, sorry!)
The source json is 13MB big:
Last relevant logs:
Finally, other gortr instances running against public instances, using the same binary, seem to be operating fine:
Docker image:
sha256:694635c16932987185a3d8d1056ef5ae287e799e7d36981a573d8baa8fc1e752
(cloudflare/gortr:latest which is 8 days old)This is a toy implementation for my own research purposes, I've not killed the daemon (yet) so if there's anything I can provide, please let me know. I figured it was worth letting you know!