Closed dagmoller closed 1 year ago
That is quite a few routers for one cache. The RPKI data set has grown quite big and some of these Routers do a full fetch every couple updates rather than the much cheaper delta fetch.
I don’t think memory (disk or RAM) is an issue here, I rather suspect there is a lot of waiting for IO going on. Can you perchance share the ratio of the different CPU states (i.e., user, system, UI wait, etc.), just so we can get an idea of where the time is spent? It’s a bit tricky to set up a test system at your scale …
After some time, less than 5 minutes, routinator crash...
ago 10 14:39:21 ********* routinator[3014]: [ERROR] Fatal error in RTR server 0.0.0.0:9001.
ago 10 14:39:21 ********* routinator[3014]: [ERROR] Fatal error. Exiting.
That is quite a few routers for one cache. The RPKI data set has grown quite big and some of these Routers do a full fetch every couple updates rather than the much cheaper delta fetch.
I don’t think memory (disk or RAM) is an issue here, I rather suspect there is a lot of waiting for IO going on. Can you perchance share the ratio of the different CPU states (i.e., user, system, UI wait, etc.), just so we can get an idea of where the time is spent? It’s a bit tricky to set up a test system at your scale …
Thank you! This looks worrying.
Can you share a few more lines from the log file? Ideally from right before the crash but also from when the high consumption started.
If you don’t want to share the log publicly, you can also send it to martin@nlnetlabs.nl.
Thank you! This looks worrying.
Can you share a few more lines from the log file? Ideally from right before the crash but also from when the high consumption started.
If you don’t want to share the log publicly, you can also send it to martin@nlnetlabs.nl.
Theres not much to see..
ii routinator 0.12.1-1bullseye amd64 An RPKI relying party software.
ago 10 14:36:08 ********** routinator[3014]: [WARN] rsync://rpki.ezdomain.ru/repo/: rsync: getaddrinfo: rpki.ezdomain.ru 873: Name or service not known ago 10 14:36:08 ********** routinator[3014]: [WARN] rsync://rpki.ezdomain.ru/repo/: rsync error: error in socket IO (code 10) at clientserver.c(137) [Receiver=3.2.3] ago 10 14:36:08 ********** routinator[3014]: [WARN] rsync://rpki.ezdomain.ru/repo/localname/1/1C36AAFD62454C67A3AE94C60A9AAE465A9CAFE2.mft: no valid manifest found. ago 10 14:37:11 ********** routinator[3014]: [WARN] RRDP https://rpki.folf.systems/rrdp/notification.xml: error sending request for url (https://rpki.folf.systems/rrdp/notification.xml): error trying to connect: tcp connect error: Connection timed out (os error 110) ago 10 14:37:26 ********** routinator[3014]: [WARN] rsync://rpki.folf.systems/repo/: rsync error: timeout waiting for daemon connection (code 35) at socket.c(278) [Receiver=3.2.3] ago 10 14:37:26 ********** routinator[3014]: [WARN] rsync://rpki.folf.systems/repo/Folf-Systems/0/E883D1D2313A14E8659F604A65D65CE39A3F826B.mft: no valid manifest found. ago 10 14:39:21 ********** routinator[3014]: [ERROR] Fatal error in RTR server 0.0.0.0:9001. ago 10 14:39:21 ********** routinator[3014]: [ERROR] Fatal error. Exiting. ago 10 14:39:21 ********** systemd[1]: routinator.service: Main process exited, code=exited, status=1/FAILURE ago 10 14:39:21 ********** systemd[1]: routinator.service: Failed with result 'exit-code'. ago 10 14:39:21 ********** systemd[1]: routinator.service: Consumed 6min 39.615s CPU time. ago 10 14:39:21 ********** systemd[1]: routinator.service: Scheduled restart job, restart counter is at 3. ago 10 14:39:21 ********** systemd[1]: Stopped Routinator 3000. ago 10 14:39:21 ********** systemd[1]: routinator.service: Consumed 6min 39.615s CPU time. ago 10 14:39:21 ********** systemd[1]: Starting Routinator 3000... ago 10 14:39:21 ********** systemd[1]: Started Routinator 3000. ago 10 14:39:36 ********** routinator[3163]: [WARN] RRDP https://rrdp.afrinic.net/notification.xml: Getting notification file failed with status 204 No Content ago 10 14:39:49 ********** routinator[3163]: [WARN] RRDP https://repo.kagl.me/rpki/notification.xml: error sending request for url (https://repo.kagl.me/rpki/notification.xml): error trying to connect: invalid peer certificate contents: invalid peer certificate: CertExpired
I've running some tests with rpki-rtr-client and just before routinator crashes, rpki-rtr-client keeps receiving data forever....
Forget all about that... all routers are misconfigured with "response-time 30". sorry!!! Now are "response-time 300" and high cpu normalizes after some time...
I have high CPU usage (400%) when 900 routers (cisco and nokia) are requesting to rtr server, the network throughput is abount 230mbit/s My VM has 4 vCPUs (Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz) and 6Gb RAM. The cache has on tmpfs of size 4Gb and is using 1.8Gb.