Testnet instance out of disk

carver commented 3 months ago

Todo:

[x] Investigate
[x] Resolve
[ ] Prevent

First look:

root@glados-angelfood:/# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G     0  3.9G   0% /dev
tmpfs           795M   79M  716M  10% /run
/dev/vda1       158G  158G     0 100% /
...
root@glados-angelfood:/var/lib/docker/containers# du -sh *
8.5M    14a17dc863d22e6377838e89fa4a1c807ec8960ee27071f148c3f6046bbeeb0b
30G 48450b241e190b5b9eb7158309bf23e5c21a505369d5119a730ee7f39c286d1d
86G a29541ce4dd1ec316fa130e992c08a4f5885c5c8b93680f7f769f245a30d8117
40M ac0834f56ddaec9c21db440bf8d64972b1d0787903b677fc76eb29dcdd5cec2b
root@glados-angelfood:/var/lib/docker/containers# docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED       STATUS       PORTS                                       NAMES
ac0834f56dda   portalnetwork/glados-cartographer:latest   "/usr/bin/glados-car…"   4 weeks ago   Up 4 weeks                                               glados-glados_cartographer-1
14a17dc863d2   portalnetwork/glados-web:latest            "/usr/bin/glados-web…"   4 weeks ago   Up 4 weeks   0.0.0.0:3001->3001/tcp, :::3001->3001/tcp   glados-glados_web-1
a29541ce4dd1   portalnetwork/glados-audit:latest          "/usr/bin/glados-aud…"   4 weeks ago   Up 4 weeks                                               glados-glados_audit-1
48450b241e19   portalnetwork/trin:latest                  "/usr/bin/trin --web…"   4 weeks ago   Up 4 weeks                                               glados-portal_client-1

So it looks like the docker containers for trin and glados-audit are eating up the majority of the space, especially glados-audit. For comparison, the mainnet instance:

devops@glados-main:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.9G     0  3.9G   0% /dev
tmpfs           795M  1.1M  793M   1% /run
/dev/vda1       158G   81G   71G  54% /
root@glados-main:/var/lib/docker/containers# du -sh *
27M 1c22495ff8272d2fa7a86424d74d7124ea3370718f7e396f49f38898d85c3c3d
15M 3ea3b452a66b7b3b1ec7689e31e10001c89049e5a45cefc7459805ed87d1b1ee
77M 694089d845e1db010dd181e2f2e656b2db2a266981bd94084368a98ce39e308c
199M    96f81c5a23a85da2b35024723b6187b31749e4473ff9b8b3ac152e74dac524c3
17M 98ca9df72b6d074d63ed1430d9a2aaa8d87560a96c36942cf1cff41292d949c9
13M 99b31f541621738f5e90800d7b22860cb4feacf32b2a533f0e5f36c524c5fa46
root@glados-main:/var/lib/docker/containers# docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED       STATUS      PORTS                                       NAMES
98ca9df72b6d   portalnetwork/glados-cartographer:latest   "/usr/bin/glados-car…"   4 days ago    Up 4 days                                               glados-glados_cartographer-1
3ea3b452a66b   portalnetwork/glados-web:latest            "/usr/bin/glados-web…"   4 days ago    Up 4 days   0.0.0.0:3001->3001/tcp, :::3001->3001/tcp   glados-glados_web-1
99b31f541621   portalnetwork/glados-monitor:latest        "/usr/bin/glados-mon…"   4 days ago    Up 4 days                                               glados-glados_monitor-1
96f81c5a23a8   portalnetwork/glados-audit:latest          "/usr/bin/glados-aud…"   4 days ago    Up 4 days                                               glados-glados_audit-1
694089d845e1   portalnetwork/trin:latest                  "/usr/bin/trin --web…"   4 days ago    Up 4 days                                               glados-portal_client-1
1c22495ff827   postgres:latest                            "docker-entrypoint.s…"   12 days ago   Up 5 days   0.0.0.0:5432->5432/tcp, :::5432->5432/tcp   glados-glados_postgres-1

In the mainnet instance, the largest docker container is only ~200MB, compared to 86GB on angelfood.

carver commented 3 months ago

The logs are accumulating indefinitely. They went back a month, and full of ~60 error logs a second like this:

[2024-08-26T21:48:43Z ERROR glados_audit::state] Error getting random state root. err=Failed to acquire connection from pool

I just deleted the logs to free up space, and make deployments work again. Not sure if this instance was ever working, it's my first time interacting with it.

carver commented 3 months ago

The trin logs are also accumulating indefinitely, with this once every 30 seconds:

{"log":"2024-08-26T21:54:25.513513Z  INFO trin_state: reports~ data: radius=0.0000% content=0.0/0mb #=0 disk=0.1mb; msgs: offers=0/0, accepts=0/0, validations=0/0\n","stream":"stdout","time":"2024-08-26T21:54:25.513802105Z"}
{"log":"2024-08-26T21:54:25.513522Z  INFO trin_state: reports~ utp: (in/out): active=0 (0/0), success=0 (0/0), failed=0 (0/0) failed_connection=0 (0/0), failed_data_tx=0 (0/0), failed_shutdown=0 (0/0)\n","stream":"stdout","time":"2024-08-26T21:54:25.513806243Z"}
{"log":"2024-08-26T21:54:25.514674Z  INFO trin_beacon: reports~ data: radius=0.0000% content=0.0/0mb #=0 disk=0.1mb; msgs: offers=0/0, accepts=0/0, validations=0/0\n","stream":"stdout","time":"2024-08-26T21:54:25.514888242Z"}
{"log":"2024-08-26T21:54:25.514693Z  INFO trin_beacon: reports~ utp: (in/out): active=0 (0/0), success=0 (0/0), failed=0 (0/0) failed_connection=0 (0/0), failed_data_tx=0 (0/0), failed_shutdown=0 (0/0)\n","stream":"stdout","time":"2024-08-26T21:54:25.514921703Z"}
{"log":"2024-08-26T21:54:25.514736Z  WARN portalnet::overlay::service: No nodes in routing table, find nodes query cannot proceed.\n","stream":"stdout","time":"2024-08-26T21:54:25.514927801Z"}
{"log":"2024-08-26T21:54:25.552064Z  INFO trin_history: reports~ data: radius=0.0000% content=0.0/0mb #=0 disk=0.1mb; msgs: offers=0/0, accepts=0/0, validations=0/0\n","stream":"stdout","time":"2024-08-26T21:54:25.552324875Z"}
{"log":"2024-08-26T21:54:25.552092Z  INFO trin_history: reports~ utp: (in/out): active=0 (0/0), success=0 (0/0), failed=0 (0/0) failed_connection=0 (0/0), failed_data_tx=0 (0/0), failed_shutdown=0 (0/0)\n","stream":"stdout","time":"2024-08-26T21:54:25.55237521Z"}

carver commented 3 months ago

Hm, my sense of the chatter is that we do not really expect this glados instance to be working, so I'm not going to do any prevention work right now.

ethereum / glados

Testnet instance out of disk #310