Open teunvink opened 2 years ago
My first thoughts here are:
Logs may be a bit misleading - I don't know if we log a request upon receiving or upon answering it (which may not be reached). We can definitely add logging at request receive time though.
I'll also add https://github.com/irrdnet/irrd/pull/721 to this project so we can get Python tracebacks while the process is running.
We're still experiencing this at least once or twice a month. Last occurence was today. CPU load graphs show 100% CPU utilisation for a long time before the process was killed:
More statistics can be found here: https://irrexplorer.nlnog.net/munin/localhost/localhost/index.html
A suspicion we have, is that although we see 'breaking' in the output, the hanging process isn't aborted.
@mxsasha: do you have any additional steps on how to investigate this better next time it occurs?
Last entries before the latest crash:
INFO: xx.xx.xx.xx:0 - "GET /api/sets/member-of/AS-ECXON HTTP/1.0" 200 OK
starting step 4 with 63 items to resolve, 51 already done
starting step 1 with 1 items to resolve, 1 already done
starting step 2 with 89 items to resolve, 1 already done
starting step 6 with 374 items to resolve, 1048 already done
breaking
No clear answer yet, but ct does seem to be somewhere in the set resolving. #220 adds a few extra debug lines, and also responding to SIGUSR1 by printing a traceback. So if you do that on a hanging process, it should tell us where the process is hanging. And the new lines logged may have answers about why.
INFO: a.b.c.d:0 - "GET /api/prefixes/asn/AS16839 HTTP/1.0" 500 Internal Server Error
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/home/irrexplorer/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 980, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore[return-value] # noqa
File "uvloop/loop.pyx", line 2039, in create_connection
File "uvloop/loop.pyx", line 2016, in uvloop.loop.Loop.create_connection
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/irrexplorer/venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 435, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/irrexplorer/venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
return await self.app(scope, receive, send)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
raise exc
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
await self.app(scope, receive, _send)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
await self.app(scope, receive, send)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__ [0/1913]
await self.app(scope, receive, send)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/routing.py", line 733, in __call__
await route.handle(scope, receive, send)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/routing.py", line 285, in handle
await self.app(scope, receive, send)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 57, in wrapped_app
raise exc
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 46, in wrapped_app
await app(scope, receive, sender)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/starlette/routing.py", line 69, in app
response = await func(request)
File "/home/irrexplorer/irrexplorer/irrexplorer/api/queries.py", line 98, in prefixes_asn
asn_prefixes = await PrefixCollector(request.app.state.database).asn_summary(
File "/home/irrexplorer/irrexplorer/irrexplorer/api/collectors.py", line 57, in asn_summary
aggregates = await self._collect_aggregate_prefixes_for_asn(asn)
File "/home/irrexplorer/irrexplorer/irrexplorer/api/collectors.py", line 104, in _collect_aggregate_prefixes_for_asn
routes_irrd, routes_bgp = await _execute_tasks(tasks)
File "/home/irrexplorer/irrexplorer/irrexplorer/api/collectors.py", line 254, in _execute_tasks
return await asyncio.gather(*tasks)
File "/home/irrexplorer/irrexplorer/irrexplorer/backends/irrd.py", line 147, in query_asn
result = await session.execute(GQL_QUERY_ASN, {"asn": asn})
File "/home/irrexplorer/venv/lib/python3.10/site-packages/gql/client.py", line 1629, in execute
result = await self._execute(
File "/home/irrexplorer/venv/lib/python3.10/site-packages/gql/client.py", line 1535, in _execute
result = await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/home/irrexplorer/venv/lib/python3.10/site-packages/gql/transport/aiohttp.py", line 308, in execute
async with self.session.post(self.url, ssl=self.ssl, **post_args) as resp:
File "/home/irrexplorer/venv/lib/python3.10/site-packages/aiohttp/client.py", line 1167, in __aenter__
self._resp = await self._coro
File "/home/irrexplorer/venv/lib/python3.10/site-packages/aiohttp/client.py", line 562, in _request
conn = await self._connector.connect(
File "/home/irrexplorer/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 540, in connect
proto = await self._create_connection(req, traces, timeout)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 901, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
File "/home/irrexplorer/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 1209, in _create_direct_connection
raise last_exc
File "/home/irrexplorer/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 1178, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
File "/home/irrexplorer/venv/lib/python3.10/site-packages/aiohttp/connector.py", line 988, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host localhost:4380 ssl:False [Connection refused]
It seems irrd had crashed:
root@irrexplorer0:~# ps aux|grep irrd
root 3966033 0.0 0.0 9080 2424 pts/0 S+ 19:01 0:00 grep --color=auto irrd
Restarting irrd made irrexplorer function again, so it seems in this case ('internal server error was served') the problem was irrd was unavailable.
Discussion: this is likely an issue that IRRexplorer does not handle IRRD query timeouts well and/or if IRRD closes the socket. IRRexplorer ends up looping forever on CPU.
Once in a while the IRR Explorer front end hangs, the only way to fix this is to kill the process and restart it. We need more information on what causes this (logs,stacktraces, memory/cpu usage, etc). We can collect the data in this issue, so we can find a cause (and solution).