server locks up during throughput load test

dch commented 3 years ago

issue

after load testing, avalanche locks up completely and will not shut down
kill -TERM doesn't work, nor control-C
kill -9 succeeds

environment

FreeBSD 13.0-RELEASE, arm64 & amd64
1.0.0.alpha31 built from source using port https://reviews.freebsd.org/D31410

reproduce

grab a FreeBSD 13.0-RELEASE system

# pkg bootstrap -f
# pkg install -y security/sudo
# pkg add https://pkg.cabal5.net/FreeBSD:13:amd64/All/avalanchemq-1.0.0.a.31.txz
### amend /usr/local/etc/avalanchemq/config.ini as required
# sudo -u avalanchemq avalanchemq --config=/usr/local/etc/avalanchemq/config.ini
### in a new terminal
# avalanchemqperf throughput
... wait a bit

Now, try to control-c in both windows, doing avalanchemqperf first. I have had hangs on either end regularly.

dtruss of processes shows:

client hangs are repeatedly doing sendto(11,"\^A\0\^A\0\0\0\^R\0<\0(\0\0\0\tp"...,72,0,NULL,0) = 72 (0x48)
yet server has already reported it closed the connection:

2021-08-07 09:33:56 +00:00 [INFO] amqpserver: Updated permissions to vhost=/ for user=__direct
2021-08-07 09:33:56 +00:00 [WARN] : Certificate for AMQPS not configured
2021-08-07 09:33:56 +00:00 [INFO] httpserver: Bound to 100.64.59.243:15672
2021-08-07 09:33:56 +00:00 [INFO] httpserver: Bound to /var/run/avalanchemq/http.sock
2021-08-07 09:33:56 +00:00 [INFO] httpserver: Bound to /tmp/avalanchemq/http.sock
2021-08-07 09:33:56 +00:00 [INFO] amqpserver: Listening on 100.64.59.243:5672
2021-08-07 09:33:56 +00:00 [INFO] amqpserver: Listening on /var/run/avalanchemq/avalanchemq.sock
2021-08-07 09:34:03 +00:00 [INFO] vhost=/ client=100.64.59.243:55121: Connection (100.64.59.243:55121 -> 100.64.59.243:5672) established for user=guest
2021-08-07 09:34:03 +00:00 [INFO] vhost=/ client=100.64.59.243:55122: Connection (100.64.59.243:55122 -> 100.64.59.243:5672) established for user=guest
2021-08-07 09:34:58 +00:00 [INFO] vhost=/: GC segments, collecting sps used 0B memory
2021-08-07 09:34:59 +00:00 [INFO] vhost=/: GC segments, garbage collecting used 0B memory
2021-08-07 09:34:59 +00:00 [INFO] vhost=/: GC segments, garbage collecting took 417.123632 ms
2021-08-07 09:34:59 +00:00 [INFO] vhost=/: GC segments, compact internal queues used 0B memory
2021-08-07 09:34:59 +00:00 [INFO] vhost=/: GC segments, GC collect used 0B memory
load: 7.02  cmd: avalanchemq 77562 [running] 128.34r 69.19u 38.70s 100% 4470060k
sys_recvfrom+0x86 amd64_syscall+0x10c fast_syscall_common+0xf8 
GC Warning: Repeated allocation of very large block (appr. size 3221229568):
        May lead to memory leak and poor performance
2021-08-07 09:38:33 +00:00 [WARN] vhost=/ client=100.64.59.243:55122 channel=1: Error when publishing message #<OverflowError:Arithmetic overflow>
2021-08-07 09:38:33 +00:00 [ERROR] vhost=/ client=100.64.59.243:55122: Unexpected error, while reading: Arithmetic overflow (OverflowError)
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???

2021-08-07 09:38:33 +00:00 [INFO] vhost=/ client=100.64.59.243:55122: Closing, INTERNAL_ERROR - Arithmetic overflow
2021-08-07 09:38:33 +00:00 [INFO] vhost=/ client=100.64.59.243:55122: Connection=100.64.59.243:55122 -> 100.64.59.243:5672 disconnected

carlhoerberg commented 3 years ago

Thank you for reporting! Is this using crystal 1.1.1?

dch commented 3 years ago

On Wed, 11 Aug 2021, at 13:05, Carl Hörberg wrote:

Thank you for reporting! Is this using crystal 1.1.1?

yep, on both aarch64 & x86:

$ crystal --version Crystal 1.1.1 ()

LLVM: 10.0.1 Default target: x86_64-portbld-freebsd13.0 ...

Default target: aarch64-portbld-freebsd14.0

We are building with these shards options:

SHARDS_ENV= --release --production --local --no-color --progress --stats --static

I've just seen https://github.com/crystal-lang/crystal/issues/10084 & https://github.com/crystal-lang/crystal/issues/9645 so far but not understood if there are any implications.

antondalgren commented 3 years ago

Hi @dch, I might have fixed this issue in #273. Would you mind see if you still see this error with latest version in main branch?

dch commented 3 years ago

:rocket: :rabbit2: looks much better! The server doesn't hang, and shuts down cleanly, but avalanchemqperf still gets wedged.

BTW I just found the -z <seconds> flag and this behaviour recurs even at just 5 seconds,using a tmpfs. details below. If you want to me re-open to dig into the avalanchemqperf problem, just say so.

/usr/local/bin/avalanchemqperf throughput --uri=amqp://guest:guest@100.64.217.185/
load: 6.11  cmd: avalanchemqperf 35703 [running] 140.40r 24.46u 76.71s 76% 8388k
tcp_output+0x638 tcp_usr_send+0x229 sosend_generic+0x633 sosend+0x50 kern_sendit+0x20e sendit+0x1db sys_sendto+0x4d amd64_syscall+0x10c fast_syscall_common+0xf8 
^C
^C^C
load: 8.92  cmd: avalanchemqperf 35703 [runnable] 376.92r 60.08u 205.40s 66% 8388k
mi_switch+0xc1 ipi_bitmap_handler+0x87 Xipi_intr_bitmap_handler+0xb3 
load: 8.92  cmd: avalanchemqperf 35703 [running] 380.06r 60.50u 207.12s 68% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x887 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 381.05r 60.65u 207.69s 68% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x887 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 381.45r 60.72u 207.93s 71% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x7d9 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 381.65r 60.73u 208.09s 73% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x887 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 381.84r 60.75u 208.20s 74% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x887 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 382.01r 60.76u 208.31s 69% 8388k
systrace_probe+0xed amd64_syscall+0x860 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 382.17r 60.82u 208.37s 70% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x7d9 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 382.82r 60.88u 208.76s 74% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x7d9 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [runnable] 383.00r 60.89u 208.88s 69% 8388k
mi_switch+0xc1 spinlock_exit+0x3d tcp_usr_send+0x850 sosend_generic+0x633 sosend+0x50 kern_sendit+0x20e sendit+0x1db sys_sendto+0x4d amd64_syscall+0x755 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 383.17r 60.91u 208.98s 70% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x7d9 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 383.37r 60.93u 209.10s 71% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x887 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [running] 384.39r 61.04u 209.71s 72% 8388k
__mtx_lock_sleep+0xe8 amd64_syscall+0x887 fast_syscall_common+0xf8 
load: 8.93  cmd: avalanchemqperf 35703 [*process lock] 384.58r 61.05u 209.82s 73% 8388k
mi_switch+0xc1 __mtx_lock_sleep+0x17c amd64_syscall+0x887 fast_syscall_common+0xf8

after the first ^c the server has decided the connection is closed, but avalanchemqperf is still trying to sent traffic (from my reading of dtruss there).

server reports:

FD limit: 65000
2021-08-26 16:03:50 +00:00 [INFO] amqpserver: Updated permissions to vhost=/ for user=__direct
2021-08-26 16:03:50 +00:00 [WARN] : Certificate for AMQPS not configured
2021-08-26 16:03:50 +00:00 [INFO] httpserver: Bound to 100.64.217.185:15672
2021-08-26 16:03:50 +00:00 [INFO] httpserver: Bound to /var/run/avalanchemq/http.sock
2021-08-26 16:03:50 +00:00 [INFO] httpserver: Bound to /tmp/avalanchemq/http.sock
2021-08-26 16:03:50 +00:00 [INFO] amqpserver: Listening on 100.64.217.185:5672
2021-08-26 16:03:50 +00:00 [INFO] amqpserver: Listening on /var/run/avalanchemq/amqp.sock
2021-08-26 16:05:07 +00:00 [INFO] vhost=/ client=100.64.217.185:55092: Connection (100.64.217.185:55092 -> 100.64.217.185:5672) established for user=guest
2021-08-26 16:05:07 +00:00 [INFO] vhost=/ client=100.64.217.185:55093: Connection (100.64.217.185:55093 -> 100.64.217.185:5672) established for user=guest
2021-08-26 16:05:44 +00:00 [INFO] vhost=/: GC segments, collecting sps used 0B memory
2021-08-26 16:05:45 +00:00 [INFO] vhost=/: GC segments, garbage collecting used 0B memory
2021-08-26 16:05:45 +00:00 [INFO] vhost=/: GC segments, garbage collecting took 396.121852 ms
2021-08-26 16:05:45 +00:00 [INFO] vhost=/: GC segments, compact internal queues used 0B memory
2021-08-26 16:05:45 +00:00 [INFO] vhost=/: GC segments, GC collect used 0B memory
GC Warning: Repeated allocation of very large block (appr. size 3221229568):
        May lead to memory leak and poor performance
2021-08-26 16:09:03 +00:00 [WARN] vhost=/ client=100.64.217.185:55093 channel=1: Error when publishing message #<OverflowError:Arithmetic overflow>
2021-08-26 16:09:03 +00:00 [ERROR] vhost=/ client=100.64.217.185:55093: Unexpected error, while reading: Arithmetic overflow (OverflowError)
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???
  from ???

2021-08-26 16:09:03 +00:00 [INFO] vhost=/ client=100.64.217.185:55093: Closing, INTERNAL_ERROR - Arithmetic overflow
2021-08-26 16:09:03 +00:00 [INFO] vhost=/ client=100.64.217.185:55093: Connection=100.64.217.185:55093 -> 100.64.217.185:5672 disconnected

however the server seems to have no trouble shutting down:

^CShutting down gracefully...
2021-08-26 16:13:48 +00:00 [INFO] vhost=/ client=100.64.217.185:55092: Closing, Broker shutdown
2021-08-26 16:13:48 +00:00 [INFO] vhost=/ client=100.64.217.185:55093: Closing, Broker shutdown
2021-08-26 16:13:58 +00:00 [INFO] vhost=/ client=100.64.217.185:55092: Connection (100.64.217.185:55092 -> 100.64.217.185:5672) disconnected for user=guest 
2021-08-26 16:13:58 +00:00 [INFO] vhost=/ client=100.64.217.185:55093: Connection (100.64.217.185:55093 -> 100.64.217.185:5672) disconnected for user=guest 
2021-08-26 16:14:00 +00:00 [INFO] vhost=/: Compacting definitions
Fibers: 
#<Fiber:0x800c5ef20: main>
#<Fiber:0x800c5ee70: Fiber Clean Loop>
#<Fiber:0x800c5edc0: Signal Loop>
#<Fiber:0x800c5ed10>
#<Fiber:0x800c5ea50: RoughTime>
#<Fiber:0x800c5e9a0>
#<Fiber:0x800d69370: VHost//#gc_segments_loop>
#<Fiber:0x863114160>
%

cloudamqp / lavinmq