Closed 7pps closed 1 year ago
Could this be the cause of the memory leak? https://github.com/hashicorp/golang-lru/issues/107
Maybe the provided hint could be used to stabilize memory usage?
Can you provide some details on this memory leak? How do you determine that this is the case? Please provide any and all details you have.
Well, ctrld starts growing (initial memory usage ~6% router memory and then more than 20%) once you use it with Safari or Chrome. If you switch on "debug" mode, you will see many "HTTPS" record queries from the browsers in addition to "A" and "AAAA" queries in the log.
Just use "top" on the command line to check memory usage.
Without "HTTPS" record queries from browsers, ctrld appears to be quite stable growing from ~6% to just ~7% router memory during one day.
Can you please provide exact outputs that you see. We need specific details.
When reloading a web page, the response times in the ctrld log show that for "HTTPS" there are no cache hits while for "A" and "AAAA" there are cache hits.
So, the cache does not seem to work with "HTTPS" queries.
Just test ctrld with Safari or Chrome. I do not have time to deliberately break it now, sorry...
We have, unable to reproduce the issue. Which is why I asked for details.
I have pointed my Mac mini (newest MacOS, newest Safari, newest Chrome) to ctrld on EdgeRouter X (newest firmware v2.0.9-hotfix.6) for DNS and tested by opening & refreshing news web pages: spiegel.de, focus.de, macrumors.com
ctrld config:
# AUTO-GENERATED VIA CD FLAG - DO NOT MODIFY
[service]
log_level = 'debug'
log_path = '/var/log/ctrld.log'
cache_enable = true
cache_size = 200
[listener.0]
ip = '127.0.0.1'
port = 53
[network.0]
name = 'Network 0'
cidrs = ['0.0.0.0/0']
[upstream.0]
name = 'P1 - Malware'
type = 'doh3'
endpoint = 'https://freedns.controld.com/p1'
timeout = 5000
[upstream.1]
name = 'P0 - Uncensored'
type = 'doh3'
endpoint = 'https://freedns.controld.com/p0'
timeout = 5000
[upstream.2]
name = 'P2 - Ads & Tracking'
type = 'doh3'
endpoint = 'https://freedns.controld.com/p2'
timeout = 5000
[upstream.3]
name = 'Family - Family Friendly'
type = 'doh3'
endpoint = 'https://freedns.controld.com/family'
timeout = 5000
Here some memory values for approx. 10 minutes:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
2023-06-27 10:06 root 3895 22.1 6.1 614292 15584 ? Ssl 10:06 0:05 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
2023-06-27 10:08 root 3895 9.8 6.7 614292 17072 ? Ssl 10:06 0:09 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
2023-06-27 10:11 root 3895 6.1 6.8 614292 17368 ? Ssl 10:06 0:17 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
2023-06-27 10:12 root 3895 5.7 6.9 614292 17612 ? Ssl 10:06 0:18 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
2023-06-27 10:17 root 3895 7.8 7.0 614292 17756 ? Ssl 10:06 0:51 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
2023-06-27 10:18 root 3895 11.1 7.3 614292 18516 ? Ssl 10:06 1:20 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
Memory usage starts with 15584 KiB and increases to 18516 KiB within 12 minutes of testing. The used memory continues to grow until the ctrld process crashes due to an "out of memory" error.
In the debug log file ctrld.debug.log, you can see the mentioned HTTPS queries. There are a lot of "add cache" messages but only very few "hit cache" messages.
Hope this helps!
@7pps is it reproducible if you don't use cache?
Here some memory values without cache:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
2023-06-27 11:35 root 7467 67.3 6.0 614292 15428 ? Ssl 11:35 0:04 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
2023-06-27 11:38 root 7467 10.6 6.6 614292 16864 ? Ssl 11:35 0:14 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
2023-06-27 11:39 root 7467 16.1 6.8 614292 17376 ? Ssl 11:35 0:37 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
2023-06-27 11:46 root 7467 14.8 6.8 614292 17376 ? Ssl 11:35 1:32 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
2023-06-27 11:50 root 7467 14.0 6.8 614292 17376 ? Ssl 11:35 2:00 /usr/sbin/ctrld run --router --iface=auto --homedir=/etc/controld --config=/etc/controld/ctrld.toml
After some growth within the first minutes, the used memory appears to settle at 17376 KiB.
Memory is leaking during network outages as well (ctrld on EdgeRouter X, see above). Here some data from a network outage of ~35sec at 04:00:
LOG TIMESTAMP PID %CPU %MEM RSS TIME
2023-07-02 02:00 23566 0.6 7.6 19268 00:06:42
2023-07-02 02:15 23566 0.6 7.6 19268 00:06:47
2023-07-02 02:30 23566 0.6 7.6 19268 00:06:53
2023-07-02 02:45 23566 0.6 7.6 19268 00:06:58
2023-07-02 03:00 23566 0.6 7.6 19268 00:07:04
2023-07-02 03:15 23566 0.6 7.6 19268 00:07:09
2023-07-02 03:30 23566 0.6 7.6 19268 00:07:15
2023-07-02 03:45 23566 0.6 7.6 19268 00:07:21
2023-07-02 04:00 23566 0.6 7.6 19268 00:07:26
2023-07-02 04:15 23566 0.7 21.2 53856 00:08:34
2023-07-02 04:30 23566 0.7 21.2 53856 00:08:38
2023-07-02 04:45 23566 0.7 21.2 53856 00:08:42
2023-07-02 05:00 23566 0.7 21.2 53856 00:08:46
2023-07-02 05:15 23566 0.7 21.2 53856 00:08:49
2023-07-02 05:30 23566 0.7 21.2 53856 00:08:53
2023-07-02 05:45 23566 0.7 21.2 53856 00:08:59
2023-07-02 06:00 23566 0.6 21.2 53856 00:09:03
Used memory jumps from 7.6% up to 21.2%.
It would be great if ctrld would be more well-behaved on memory because high memory usage leads to an overall slow down of the router, including crashes of ctrld.
@7pps is memory back to normal after internet connection restored?
No, it remains large as you can see from the log. The network outage happens between 04:00 and 04:01 (this is time hh:mm).
Maybe query errors lead to wasted memory?
But memory resizing could be one way of implementing it: trigger a memory purge/flush once it has grown too large, like a restart that keeps valid cache entries but otherwise frees up allocated memory.
Can you please try the latest release and see if the issue is resolved? https://github.com/Control-D-Inc/ctrld/releases/tag/v1.3.0
I have tried to use v1.3.0 but get the following errors:
root@ubnt:/etc/controld# sudo ctrld start
Aug 16 17:48:51.000 NTC Starting service
Aug 16 17:48:52.000 ERR ctrld service may not have started due to an error or misconfiguration, service log:
Aug 16 17:48:52.000 ??? ================================
Aug 16 17:48:52.000 ??? Aug 16 17:48:52.000 FTL listener.0 failed to listen: listen udp 127.0.0.1:53: bind: address already in use
listen tcp 127.0.0.1:53: bind: address already in use
Aug 16 17:48:52.000 ??? ================================
Aug 16 17:48:58.000 NTC Service uninstalled
root@ubnt:/etc/controld# sudo ctrld setup auto
Aug 16 17:52:46.000 NTC Starting service
Aug 16 17:52:49.000 ERR ctrld service may not have started due to an error or misconfiguration, service log:
Aug 16 17:52:49.000 ??? ================================
Aug 16 17:52:49.000 ??? Aug 16 17:52:49.000 FTL listener.0 failed to listen: listen udp 127.0.0.1:53: bind: address already in use
listen tcp 127.0.0.1:53: bind: address already in use
Aug 16 17:52:49.000 ??? ================================
Aug 16 17:52:54.000 NTC Service uninstalled
Aug 16 17:52:54.000 FTL exit status 1
I have noticed that the file /etc/dnsmasq.d/dnsmasq-zzz-ctrld.conf
is not created.
Could you please check?
For the time being I have returned to v1.2.1.
The error is quite clear: you have something (dnsmasq probably) listening on 127.0.0.1:53 already. As you're running ctrld
in custom config mode, you're responsible for avoiding listener collisions as is the case with every single piece of server software.
You can either:
The reason why v1.2.1 may work for you, because it does not follow standards. v1.3.0 does.
Well, dnsmasq is listening on port 53 before ctrld is started (which is correct for EdgeRouter without ctrld). v1.2.1 moves it to port 5354:
root@ubnt:/etc/dnsmasq.d# cat dnsmasq-zzz-ctrld.conf
# GENERATED BY ctrld - DO NOT MODIFY
no-resolv
server=127.0.0.1#5354
add-mac
This should also be done by v1.3.0 or do you want to introduce manual steps?
Yes, that is the non-standard behavior I mentioned. In local config mode, ctrld
enforces the EXACT config you created, which is standard for any kind of server software. In v1.2.1 it did not enforce your exact config which made it "work" but this is unpredictable and non-standard.
You need to do one of the 2 solutions I mentioned above.
If you edit your ctrld.toml config and choose any non-conflicting port except 53 (like 5354), it will work.
Ok, I have added port=5354
to /etc/dnsmasq.conf
and now it starts fine.
I am using 3 subnets on different VLANs. With listener.0
configured to 127.0.0.1
, ctrld chooses one of the subnet gateways as listener ip. Unfortunately, it chooses the one which I do not want to use: 10.10.30.1
instead of 10.10.20.1
:
root@ubnt:~# lsof -i -P -n | grep 53
ctrld 31771 root 14u IPv4 197498 0t0 TCP 127.0.0.1:53 (LISTEN)
ctrld 31771 root 17u IPv4 198463 0t0 UDP 10.10.30.1:53
ctrld 31771 root 18u IPv4 197516 0t0 UDP 127.0.0.1:53
ctrld 31771 root 19u IPv4 197521 0t0 UDP *:5353
ctrld 31771 root 20u IPv4 198464 0t0 TCP 10.10.30.1:53 (LISTEN)
ctrld 31771 root 22u IPv4 197524 0t0 UDP *:5353
ctrld 31771 root 23u IPv4 199045 0t0 UDP *:5353
ctrld 31771 root 24u IPv4 199049 0t0 UDP *:5353
ctrld 31771 root 25u IPv4 197531 0t0 UDP *:5353
ctrld 31771 root 26u IPv4 200003 0t0 UDP *:5353
ctrld 31771 root 27u IPv4 200008 0t0 UDP *:5353
ctrld 31771 root 28u IPv4 200012 0t0 UDP *:5353
ctrld 31771 root 29u IPv4 199057 0t0 UDP *:5353
ctrld 31771 root 30u IPv4 200016 0t0 UDP *:5353
ctrld 31771 root 31u IPv4 200018 0t0 UDP *:5353
ctrld 31771 root 32u IPv4 197542 0t0 UDP *:5353
ctrld 31771 root 37u IPv6 197553 0t0 UDP *:43754
After configuring listener.0
to 10.10.20.1
, it appears to work fine but there is no listener on 127.0.0.1
anymore:
root@ubnt:~# lsof -i -P -n | grep 53
ctrld 32005 root 16u IPv4 200381 0t0 UDP 10.10.20.1:53
ctrld 32005 root 17u IPv4 201992 0t0 TCP 10.10.20.1:53 (LISTEN)
ctrld 32005 root 19u IPv4 202991 0t0 UDP *:5353
ctrld 32005 root 20u IPv4 202016 0t0 UDP *:5353
ctrld 32005 root 21u IPv4 202020 0t0 UDP *:5353
ctrld 32005 root 22u IPv4 202025 0t0 UDP *:5353
ctrld 32005 root 23u IPv4 202029 0t0 UDP *:5353
ctrld 32005 root 24u IPv4 200918 0t0 UDP *:5353
ctrld 32005 root 25u IPv4 200388 0t0 UDP *:5353
ctrld 32005 root 26u IPv4 200392 0t0 UDP *:5353
ctrld 32005 root 27u IPv4 202996 0t0 UDP *:5353
ctrld 32005 root 28u IPv4 200929 0t0 UDP *:5353
ctrld 32005 root 29u IPv4 200924 0t0 UDP *:5353
ctrld 32005 root 30u IPv4 200934 0t0 UDP *:5353
ctrld 32005 root 33u IPv6 202103 0t0 UDP *:53028
Is there any need for listening on 127.0.0.1:53
?
HTTPS queries are answered with cache hits which did not work in v1.2.1. I will let v1.3.0 run over night and provide memory stats by tomorrow.
Picking of subnets/gateways is outside the scope of ctrld, this is controlled by you on your network. ctrld
spawns DNS listeners, and uses whatever the default gateway is from its perspective.
dnsmasq listens on 127.0.0.1:53, so ctrld can't listen on that, unless you kill dnsmasq.
I have moved dnsmasq to port 5354:
root@ubnt:~# host verify.controld.com 127.0.0.1
;; WARNING: response timeout for 127.0.0.1@53(UDP)
;; WARNING: response timeout for 127.0.0.1@53(UDP)
;; WARNING: failed to query server 127.0.0.1@53(UDP)
...
Is that an issue?
It would be nice, if there would not be a potentially unwanted listener on a randomly chosen default gateway but just the 127.0.0.1
listener for the default listener.0
setup on 127.0.0.1
.
Others can always be manually configured as listener.n
.
After changing the port of listener.0
to 5354 in ctrld.toml and moving dnsmasq back to port 53, v1.3.0 behaves like v1.2.1, no random listener.
Regarding memory, ctrld is in general still slowly growing and heavily expanding during internet outages:
LOG TIMESTAMP PID %MEM VSZ RSS
2023-08-18 10:57 12810 6.4 615264 16340
2023-08-18 10:59 12810 14.7 616032 37352
Used memory increases within 2 minutes to about twice the size.
Do you have caching enabled?
Yes, cache size 1000.
If you disable the cache, is the memory usage more static?
It is about the same without cache.
LOG TIMESTAMP PID %MEM VSZ RSS
2023-08-18 17:07 24574 6.6 615264 16844
2023-08-18 17:10 24574 29.1 616800 73716
Without cache used memory increased more than 4x during 2 minutes of internet outage.
I have upgraded to ctrld 1.3.1 and the memory leak appears to be fixed. In addition, I have added cron jobs that renew the internet connection and restart ctrld during the night.
Thanks for the great work!
LOG TIMESTAMP PID %CPU %MEM RSS TIME
2023-11-06 17:00 10011 0.4 8.8 22284 00:03:47
2023-11-06 17:15 10011 0.4 9.0 22976 00:03:52
2023-11-06 17:30 10011 0.4 9.0 22996 00:04:01
2023-11-06 17:45 10011 0.4 9.0 22948 00:04:08
2023-11-06 18:00 10011 0.4 9.0 22956 00:04:15
2023-11-06 18:15 10011 0.4 9.2 23512 00:04:21
2023-11-06 18:30 10011 0.4 9.2 23396 00:04:27
2023-11-06 18:45 10011 0.4 9.1 23248 00:04:35
2023-11-06 19:00 10011 0.5 9.1 23176 00:04:41
2023-11-06 19:15 10011 0.5 9.1 23112 00:04:45
2023-11-06 19:30 10011 0.5 9.3 23680 00:04:51
2023-11-06 19:45 10011 0.5 9.0 23036 00:04:56
2023-11-06 20:00 10011 0.5 9.1 23152 00:05:03
2023-11-06 20:15 10011 0.5 9.2 23468 00:05:08
2023-11-06 20:30 10011 0.5 9.2 23320 00:05:13
2023-11-06 20:45 10011 0.5 9.1 23156 00:05:17
2023-11-06 21:00 10011 0.5 9.0 22932 00:05:27
2023-11-06 21:15 10011 0.5 9.0 22896 00:05:32
2023-11-06 21:30 10011 0.5 8.8 22388 00:05:35
2023-11-06 21:45 10011 0.5 8.9 22540 00:05:46
2023-11-06 22:00 10011 0.5 9.1 23192 00:05:51
2023-11-06 22:15 10011 0.5 8.8 22456 00:05:55
2023-11-06 22:30 10011 0.5 9.1 23044 00:06:02
2023-11-06 22:45 10011 0.5 8.9 22752 00:06:06
2023-11-06 23:00 10011 0.5 9.3 23576 00:06:18
2023-11-06 23:15 10011 0.5 8.7 22176 00:06:21
2023-11-06 23:30 10011 0.5 8.8 22452 00:06:24
2023-11-06 23:45 10011 0.5 8.9 22592 00:06:27
2023-11-07 00:00 10011 0.5 8.9 22640 00:06:30
2023-11-07 00:15 10011 0.5 9.0 22980 00:06:34
2023-11-07 00:30 10011 0.5 8.9 22656 00:06:36
2023-11-07 00:45 10011 0.5 8.7 22172 00:06:40
2023-11-07 01:00 10011 0.5 9.1 23132 00:06:43
2023-11-07 01:15 10011 0.5 9.1 23292 00:06:46
2023-11-07 01:30 10011 0.5 9.2 23448 00:06:50
2023-11-07 01:45 10011 0.5 9.0 22980 00:06:53
2023-11-07 02:00 10011 0.5 9.2 23428 00:06:56
2023-11-07 02:15 10011 0.5 8.8 22508 00:06:59
2023-11-07 02:30 10011 0.5 8.9 22596 00:07:02
2023-11-07 02:45 10011 0.5 8.9 22560 00:07:05
2023-11-07 03:00 10011 0.5 9.0 22804 00:07:08
2023-11-07 03:15 10011 0.5 8.7 22264 00:07:11
2023-11-07 03:30 16002 1.6 7.8 19900 00:00:04
2023-11-07 03:45 16002 0.6 8.3 21188 00:00:08
2023-11-07 04:00 16002 0.5 8.3 21096 00:00:10
2023-11-07 04:15 16002 0.4 8.5 21692 00:00:13
2023-11-07 04:30 16002 0.4 8.5 21632 00:00:16
2023-11-07 04:45 16002 0.4 8.5 21564 00:00:19
2023-11-07 05:00 16002 0.3 8.3 21092 00:00:22
2023-11-07 05:15 16002 0.3 8.3 21048 00:00:24
2023-11-07 05:30 16002 0.3 8.2 20924 00:00:27
2023-11-07 05:45 16002 0.3 8.8 22444 00:00:30
2023-11-07 06:00 16002 0.3 8.3 21220 00:00:33
2023-11-07 06:15 16002 0.3 8.6 21852 00:00:36
2023-11-07 06:30 16002 0.3 8.8 22324 00:00:39
2023-11-07 06:45 16002 0.3 8.7 22208 00:00:42
2023-11-07 07:00 16002 0.3 8.5 21544 00:00:49
2023-11-07 07:15 16002 0.3 8.3 21020 00:00:51
2023-11-07 07:30 16002 0.3 8.5 21768 00:00:55
2023-11-07 07:45 16002 0.3 8.4 21288 00:00:58
2023-11-07 08:00 16002 0.3 8.3 21156 00:01:01
2023-11-07 08:15 16002 0.3 8.4 21328 00:01:07
2023-11-07 08:30 16002 0.3 8.4 21364 00:01:10
2023-11-07 08:45 16002 0.3 8.5 21764 00:01:13
2023-11-07 09:00 16002 0.3 8.3 21140 00:01:16
2023-11-07 09:15 16002 0.3 8.5 21544 00:01:18
2023-11-07 09:30 16002 0.3 8.3 21100 00:01:21
2023-11-07 09:45 16002 0.3 8.9 22580 00:01:24
2023-11-07 10:00 16002 0.3 8.2 20988 00:01:27
2023-11-07 10:15 16002 0.3 8.6 21840 00:01:30
2023-11-07 10:30 16002 0.3 8.1 20752 00:01:33
2023-11-07 10:45 16002 0.3 8.5 21668 00:01:36
2023-11-07 11:00 16002 0.3 8.6 21840 00:01:38
2023-11-07 11:15 16002 0.3 8.4 21396 00:01:41
2023-11-07 11:30 16002 0.3 8.6 21964 00:01:44
2023-11-07 11:45 16002 0.3 8.6 21940 00:01:47
2023-11-07 12:00 16002 0.3 8.6 21956 00:01:49
2023-11-07 12:15 16002 0.3 8.3 21180 00:01:52
2023-11-07 12:30 16002 0.3 8.3 21152 00:01:55
2023-11-07 12:45 16002 0.3 8.5 21692 00:01:57
2023-11-07 13:00 16002 0.3 8.3 21196 00:02:00
2023-11-07 13:15 16002 0.3 8.3 21148 00:02:03
2023-11-07 13:30 16002 0.3 8.6 21892 00:02:06
2023-11-07 13:45 16002 0.3 8.4 21424 00:02:09
2023-11-07 14:00 16002 0.3 8.3 21224 00:02:12
2023-11-07 14:15 16002 0.3 8.5 21616 00:02:15
2023-11-07 14:30 16002 0.3 8.3 21220 00:02:19
2023-11-07 14:45 16002 0.3 8.6 21804 00:02:21
2023-11-07 15:00 16002 0.3 8.3 21100 00:02:24
2023-11-07 15:15 16002 0.3 9.0 22932 00:02:30
2023-11-07 15:30 16002 0.3 9.1 23148 00:02:37
2023-11-07 15:45 16002 0.3 9.0 22908 00:02:42
2023-11-07 16:00 16002 0.3 8.7 22084 00:02:47
2023-11-07 16:15 16002 0.3 8.9 22700 00:02:53
2023-11-07 16:30 16002 0.3 8.7 22060 00:03:03
2023-11-07 16:45 16002 0.3 8.7 22076 00:03:10
2023-11-07 17:00 16002 0.4 8.3 21192 00:03:16
@7pps How can you tell its fixed if you're restarting it every day?
Well, as you can see from the log, PID 16002 started with RSS ~21K and returns again and again to RSS ~21K. It also does so, if it runs for more than 24h. In previous versions of ctrld, RSS did always grow.
To renew internet connection every 24h is just common practice here.
Best Regards!
I have ctrld installed on EdgeRouter X and it is running fine for "normal" DNS queries, e.g. A records via dig.
However, once you open some websites in browsers like Safari or Chrome, the memory of ctrld starts to grow. I have tried cache sizes 100, 300, 1000.
As far as I can see, this appears to be related to HTTPS (TYPE65) record queries. The used memory of ctrld may easily increase to more than 20% of router memory - until ctrld crashes. If the browsers are configured to use other DNS servers, ctrld runs fine with a very slow memory growth, e.g. DNS for Sonos, Printer, NAS.
Could HTTPS (TYPE65) records be excluded from caching or the cache be checked for memory leaks?
Please test it by pointing Chrome to ctrld for DNS.
Thanks & Best Regards!