SiriDB / siridb-server

SiriDB is a highly-scalable, robust and super fast time series database. Build from the ground up SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDB's unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series.
https://siridb.com
MIT License
506 stars 48 forks source link

Server stuck in libuv #156

Closed ubnt-michals closed 3 years ago

ubnt-michals commented 3 years ago

Some of our customers recently reported SiriDB server as being stuck. I've managed to get hold of one of those servers a took a core dump.

It looks like the server is unable to accept any connection and from the backtrace, it looks like it's stuck somewhere in libuv.

The situation is more confusing by the fact that the health check at GET /status is working fine and the container is not restarted.

The docker image: https://hub.docker.com/layers/ubnt/unms-siridb/1.3.3/images/sha256-1f194131d97ae00595fccc0e212f0ef46a5b18a41ca530bc2c004b40879cf96f?context=explore Core dump: https://drive.google.com/file/d/1L-TfgJe4rQImBz6nG4tHvkoc7ILDi4R3/view?usp=sharing OS: Linux unms 4.19.0-13-amd64 #1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux Docker: 20.10.0, build 7287ab3

Core was generated by `siridb-server'.
#0  0x00007f8f2b8a2516 in epoll_pwait () from /lib/ld-musl-x86_64.so.1
[Current thread is 1 (LWP 8)]

#0  epoll_pwait (fd=3, ev=ev@entry=0x7ffc3a951328, cnt=cnt@entry=1024, to=to@entry=215, sigs=sigs@entry=0x0) at ./arch/x86_64/syscall_arch.h:61
#1  0x00007f8f2b8a253d in epoll_wait (fd=<optimized out>, ev=ev@entry=0x7ffc3a951328, cnt=cnt@entry=1024, to=to@entry=215) at src/linux/epoll.c:36
#2  0x00007f8f2b8751fd in uv__io_poll (loop=loop@entry=0x564df7866d00, timeout=215) at src/unix/linux-core.c:307
#3  0x00007f8f2b867d7a in uv_run (loop=0x564df7866d00, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:385
#4  0x0000564df6069d5a in siri_start () at ../src/siri/siri.c:342
#5  0x0000564df601ed7c in main (argc=argc@entry=1, argv=argv@entry=0x7ffc3a954808) at ../main.c:76
#6  0x00007f8f2b89f1ef in libc_start_main_stage2 (main=0x564df601ec10 <main>, argc=1, argv=0x7ffc3a954808) at src/env/__libc_start_main.c:94
#7  0x0000564df601ee15 in _start ()
joente commented 3 years ago

@ubnt-michals , is it possible to reproduce this error? Did you check if the volume has enough free space?

ubnt-michals commented 3 years ago

@joente

is it possible to reproduce this error?

Currently no, I am looking for some repro. But it happens regularly. Usually, the server gets stuck after 2 - 3 hours.

Did you check if the volume has enough free space?

Free disk space, CPU, and memory look ok.


EDIT:

I take back the memory. It looks like a memory leak.

CONTAINER ID   NAME            CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
25a49fe19a9b   unms-siridb     1.56%     1.041GiB / 17.63GiB   5.90%     6.8GB / 2.28GB    1.08GB / 168GB    12

The server takes over 1GB of RAM.

joente commented 3 years ago

@ubnt-michals , do you approximately know how much databases, series and data points you had in SiriDB? (at the moment SiriDB was using 1 GB of ram). Do you also know what the select queries look like? Are you using tags, groups, regular expression or fixed names to select your series?

ubnt-michals commented 3 years ago

@joente Thanks. Please, disregard the issue. It looks like the database is just very slow, 20+s for a query. We feed it about 1.5 - 2k points a second from 40+ connections. It looks to me that since it's running essentially single-threaded, there must be a long backlog, and with slow disk I/O, the queries are just slow. It might also explain the gradual RAM increase (the backlog build up).

I'll try to tweak the way we feed the database. Maybe using fewer connections and sending larger chunks or just plainly sending fewer data will help.

joente commented 3 years ago

@ubnt-michals , Did you try to set the SIRIDB_BUFFER_SYNC_INTERVAL environment variable to something like 500? It might help to get the database faster