CZ-NIC / knot-resolver

Knot Resolver - resolve DNS names like it's 2024
https://www.knot-resolver.cz/
Other
362 stars 59 forks source link

Cache failes with SIGBUS during prefill refresh #113

Open alfkonee opened 3 months ago

alfkonee commented 3 months ago

Form my logs it would seem that during the prefill refresh if the zone file download fails it sometimes causes the cache FS to fails for some unknown reason see logs

Aug 02 19:18:05 HOSTNAME kresd[584737]: [prefil] root zone file valid for 13 seconds, reusing data from disk
Aug 02 19:18:05 HOSTNAME supervisord[584725]: captured stdio output from kresd1[584737] (stderr):
Aug 02 19:18:05 HOSTNAME kresd[584747]: [prefil] downloading root zone to file root.zone ...
Aug 02 19:18:05 HOSTNAME supervisord[584725]: captured stdio output from kresd1[584737] (stderr): SIGBUS received; this is most likely due to filling up the filesystem where cache resides.
Aug 02 19:18:05 HOSTNAME supervisord[584725]: captured stdio output from kresd1[584737] (stderr):
Aug 02 19:18:05 HOSTNAME supervisord[584725]: captured stdio output from kresd1[584737] (stderr):
Aug 02 19:18:05 HOSTNAME supervisord[584725]: captured stdio output from kresd1[584737] (stderr):
Aug 02 19:18:05 HOSTNAME supervisord[584725]: exited: kresd1 (exit status 121; not expected)
Aug 02 19:18:05 HOSTNAME aiohttp.server[584733]: Error handling request
                                                    Traceback (most recent call last):
                                                      File "/usr/lib64/python3.9/site-packages/aiohttp/web_protocol.py", line 452, in _handle_request
                                                        resp = await request_handler(request)
                                                      File "/usr/lib64/python3.9/site-packages/aiohttp/web_app.py", line 543, in _handle
                                                        resp = await handler(request)
                                                      File "/usr/lib64/python3.9/site-packages/aiohttp/web_middlewares.py", line 114, in impl
                                                        return await handler(request)
                                                      File "/usr/lib64/python3.9/site-packages/knot_resolver_manager/server.py", line 62, in error_handler
                                                        return await handler(request)
                                                      File "/usr/lib64/python3.9/site-packages/knot_resolver_manager/server.py", line 243, in _handler_metrics
                                                        body=await statistics.report_stats(),
                                                      File "/usr/lib64/python3.9/site-packages/knot_resolver_manager/statistics.py", line 338, in report_stats
                                                        await _resolver_collector.collect_kresd_stats()
                                                      File "/usr/lib64/python3.9/site-packages/knot_resolver_manager/statistics.py", line 96, in collect_kresd_stats
                                                        stats_raw = await command_registered_workers(cmd)
                                                      File "/usr/lib64/python3.9/site-packages/knot_resolver_manager/kresd_controller/registered_workers.py", line 34, in command_registered_workers
                                                        pairs = await asyncio.gather(*(single_pair(inst) for inst in _REGISTERED_WORKERS.values()))
                                                      File "/usr/lib64/python3.9/site-packages/knot_resolver_manager/kresd_controller/registered_workers.py", line 32, in single_pair
                                                        return sub.id, await sub.command(cmd)
                                                      File "/usr/lib64/python3.9/site-packages/knot_resolver_manager/kresd_controller/interface.py", line 173, in command
                                                        reader, writer = await asyncio.open_unix_connection(f"./control/{int(self.id)}")
                                                      File "/usr/lib64/python3.9/asyncio/streams.py", line 111, in open_unix_connection
                                                        transport, _ = await loop.create_unix_connection(
                                                      File "/usr/lib64/python3.9/asyncio/unix_events.py", line 244, in create_unix_connection
                                                        await self.sock_connect(sock, path)
                                                      File "/usr/lib64/python3.9/asyncio/selector_events.py", line 500, in sock_connect
                                                        return await fut
                                                      File "/usr/lib64/python3.9/asyncio/selector_events.py", line 505, in _sock_connect
                                                        sock.connect(address)
                                                    ConnectionRefusedError: [Errno 111] Connection refused
Aug 02 19:18:05 HOSTNAME aiohttp.access[584733]: 10.157.2.182 [02/Aug/2024:19:18:05 +0000] "GET /metrics HTTP/1.1" 500 244 "-" "Prometheus/2.53.1"
Aug 02 19:18:06 HOSTNAME supervisord[584725]: spawned: 'kresd1' with pid 596546
Aug 02 19:18:06 HOSTNAME kresd[596546]: [system] path = /run/knot-resolver/control/1
Aug 02 19:18:06 HOSTNAME kresd[596546]: [rules ] cleared 1 stale reader locks
Aug 02 19:18:06 HOSTNAME kresd[584748]: [prefil] root zone file valid for 09 minutes 55 seconds, reusing data from disk
Aug 02 19:18:06 HOSTNAME kresd[584748]: [prefil] empty zone file
Aug 02 19:18:06 HOSTNAME kresd[584748]: [prefil] error parsing zone file `root.zone`
Aug 02 19:18:06 HOSTNAME kresd[584748]: [prefil] root zone import failed (/usr/lib64/knot-resolver/kres_modules/prefill.lua:97: [prefil] zone import failed: cdata<const char *>: 0x7f11ac7c532d), retry in 09 minutes 52 seconds
Aug 02 19:18:08 HOSTNAME kresd[584746]: [rules ] cleared 1 stale reader locks
Aug 02 19:18:08 HOSTNAME kresd[584738]: [prefil] downloading root zone to file root.zone ...
Aug 02 19:18:08 HOSTNAME kresd[584753]: [system] assertion "env->is_cache" failed in cdb_write@../lib/cache/cdb_lmdb.c:689
Aug 02 19:18:08 HOSTNAME kresd[584753]: [cache ] LMDB refusing writes (ignored for 5-9s): Invalid argument
Aug 02 19:18:08 HOSTNAME kresd[596546]: [system] error while loading config: kresd1.conf:263: can't open cache path '/var/cache/knot-resolver'; working directory '/run/knot-resolver'; No space left on device (workdir '/run/knot-resolver')
Aug 02 19:18:08 HOSTNAME kresd[584753]: [system] assertion "env->is_cache" failed in cdb_write@../lib/cache/cdb_lmdb.c:689
Aug 02 19:18:08 HOSTNAME kresd[584753]: [tasign] signalling query triggered: _ta-4f66.
Aug 02 19:18:08 HOSTNAME kresd[584753]: [system] assertion "env->is_cache" failed in cdb_write@../lib/cache/cdb_lmdb.c:689
Aug 02 19:18:08 HOSTNAME kresd[584753]: [system] assertion "env->is_cache" failed in cdb_write@../lib/cache/cdb_lmdb.c:689
Aug 02 19:18:08 HOSTNAME kresd[584753]: [tasign] signalling query triggered: _ta-4f66.
Aug 02 19:18:08 HOSTNAME supervisord[584725]: exited: kresd1 (exit status 1; not expected)

Please here is my current config

rundir: /run/knot-resolver
workers: 16
cache:
  storage: /var/cache/knot-resolver
  size-max: 20G
  prefill:
    - origin: "."
      url: https://www.internic.net/domain/root.zone
      refresh-interval: 6h

logging:
  level: info
  groups:
    - system
    - net
    - dnssec
    - resolv
    - select
    - exterr
    - policy

network:
  listen:
    - interface: lo@53
      kind: dns
    - interface: ens192@53
      kind: dns
    - interface: ens192
      kind: dot

management:
  interface: 10.157.2.180@5000
  # unix-socket: /run/knot-resolver/manager.sock

and the output of my file system structure

devtmpfs                    4.0M     0  4.0M   0% /dev
tmpfs                        16G     0   16G   0% /dev/shm
tmpfs                       6.2G  696M  5.5G  12% /run
/dev/mapper/vg_oel-lv_root  246G  5.5G  228G   3% /
/dev/sda2                   960M  418M  543M  44% /boot
/dev/sda1                  1022M  6.3M 1016M   1% /boot/efi
/dev/mapper/vg_oel-lv_var   127G   29G   92G  24% /var
tmpfs                        27G   21G  7.0G  75% /var/cache/knot-resolver
tmpfs                       3.1G     0  3.1G   0% /run/user/1001  
vcunat commented 3 months ago

this is most likely due to filling up the filesystem where cache resides.

You have 21G used by cache right now? I can't recall ever testing with such a large size. EDIT: oh yeah, 20G size in config.

vcunat commented 3 months ago

Still the 7G free should be a sufficient pillow. Could it be that you were running out of RAM? (I think that can cause tmpfs fail to fill a hole in the file and thus SIGBUS in LMDB.)