LINBIT / linstor-server

High Performance Software-Defined Block Storage for container, cloud and virtualisation. Fully integrated with Docker, Kubernetes, Openstack, Proxmox etc.
https://docs.linbit.com/docs/linstor-guide/
GNU General Public License v3.0
989 stars 76 forks source link

linstor-satellite restart leads to linstor-controller overutillization #391

Open ddpolyakov opened 9 months ago

ddpolyakov commented 9 months ago

Hi! Im using 1.25.1 version Linstor + etcd on separate nods as database. around 100 diskless nodes and 10 storage nodes. Total around 1.5K resources Every time I restart satellite (any) - linstor controller goes mad eating every cpu possible via threads. Stracing Controller shows tons of futexes all over the spawned threads [pid 1910062] futex(0x7f82495fd77c, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910061] futex(0x7f82495fa0c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910060] futex(0x7f82495f8678, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910059] futex(0x7f82495f6a68, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910058] futex(0x7f82495f4c98, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910057] futex(0x7f82495f2ed8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910056] futex(0x7f82495f12c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910055] futex(0x7f82495ef518, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910054] futex(0x7f82495ed908, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910053] futex(0x7f82495ebcf8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 1910052] futex(0x7f82495ea0e8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>

image (6) Attaching the htop output at controller server during linstor-satellite restart

ghernadi commented 9 months ago

Please upgrade to at least 1.26.1 and see if this issue persists. In the said version we tried to fix a bug that could lead to such a behavior.

ddpolyakov commented 9 months ago

Updated at the 1.26.2 - the same behaviour

ddpolyakov commented 9 months ago

and the same thing using mysqld galera cluster as a database

ddpolyakov commented 9 months ago

switching back to H2 seems to resolve the problem

ghernadi commented 9 months ago

If this is reproducible and you are willing to test this further, can you trigger the controller into this state and poke it a few times with kill -3 <pid_of_controller_java_process> and get me an SOS report? kill -3 causes the JVM to print a thread-dump to its stdout (which is usually captured by journalctl, which is then collected via LINSTOR's SOS report). If possible, run the kill -3 a few times, so we have a chance to see what the Threads are doing.

Additionally you could also activate TRACE logging for the controller and then trigger this behavior. Feel free to send me the resulting SOS report to the email from my profile

ddpolyakov commented 8 months ago

Here is my sos-report - ive run kill -3 few times just after all satellites restart. The same picture - Controller ate all cpu

amykhalskyi commented 5 months ago

After updating to 1.27.1 and mariadb backend, we still can see this issue. Sometimes, after restart of linstor satellite or crash of some node with satellite, linstor controller stuck with very high CPU consumption and doesn`t respond to any command. I attached JVM thread-dump and screen of perf top linstor_stuck_02072024_dump.gz perf_top