MongooseIM is Erlang Solutions' robust, scalable and efficient XMPP server, aimed at large installations. Specifically designed for enterprise purposes, it is fault-tolerant and can utilise the resources of multiple clustered machines.
Other
1.64k
stars
422
forks
source link
GraphQL SSE crashes on large message stanza payload #4208
I am currently running MongooseIM in a Kubernetes cluster (EKS AWS) using the official Helm chart from ESL/MongooseHelm to provide a chat feature for my application. The client chat binding utilizes WebSocket to interact with chat. To make the chat interactive, a system bot is running and interacting with the chat using c2s XML stream, consuming and responding to user messages. Recently, I started a migration to GraphQL API to unify the communication approach on both sides and simplify the client implementation by achieving segregation between the consuming of messages and actions in the chat.
Issue
I have encountered a problem since migrating to GraphQL which is related to the SSE stream in MongooseIM. The problem occurs when broadcasting a large message stanza payload (~2000 characters) via GraphQL SSE stream. This action leads to a crash in the SSE stream which leads to the closing of EventSource, but the user's session becomes stuck in a live state (cannot be kicked) for an indefinite period of time. Subsequent reconnection attempts do not resolve the issue, and the SSE does not work (no messages broadcasted) for the affected user until the server is restarted. Meanwhile, WS and c2s don't crash on such messages and continue to work when SSE stream is broken.
Additional Information:
I intentionally use the MongooseIM latest version from the master branch (docker image latest tag), because I was struggling with SSE connection timeouts and they were addressed by merging the following MR: https://github.com/esl/MongooseIM/pull/4196.
I am relying on mod_muc with mnesia backend and mod_mam with rdbms backend
Usually GraphQL SSE stream fails on 1600+ chars in message body, but always on 2000+
Attachments:
mongooseim.json.1
{"when":"2024-01-11T12:09:41.025376+00:00","what":"mongoose_graphql_sse_handler_failed","text":{"id":"28","data":["{\"data\":{\"stanza\":{\"subscribeForMessages\":{\"timestamp\":\"2024-01-11T12:09:41.025198Z\",\"stanza_id\":\"a99f7e0ee96f429bbeabb6e5b06fdcc9\",\"stanza\":\"<message id='a99f7e0ee96f429bbeabb6e5b06fdcc9' to='main_thread_user+436656@conf.xmpp-dev' from='alice@xmpp-dev/1704-974935-740798-752907a8916cb029' type='groupchat'><body>test 1600 tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt</body></message>\",\"sender\":","\"main_thread_user+436656@conf.xmpp-dev/Alice\"}}}}"]},"reason":"badarg","meta":{"pid":"<0.1362.0>","mfa":"{mongoose_graphql_sse_handler,handle_error,3}","line":"94","file":"/home/circleci/project/src/graphql/mongoose_graphql_sse_handler.erl"},"level":"error"}
{"when":"2024-01-11T12:09:41.715586+00:00","report":[["{initial_call,{cowboy_stream_h,request_process,['Argument__1','Argument__2','Argument__3']}}","{pid,<0.1362.0>}","{registered_name,[]}","{error_info,{error,{badrecord,ok},[{lasse_handler,terminate,3,...},{cowboy_loop,call,...},{cowboy_stream_h,...},{...}|...]}}","{ancestors,[<0.1361.0>,<0.1110.0>,<0.1107.0>,<0.1106.0>,ranch_sup,<0.250.0>]}","{message_queue_len,0}","{messages,[]}","{links,[<0.1361.0>]}","{dictionary,[{rand_seed,{#{type => exsss,next => #Fun<rand.0.65977474>,bits => 58,...},[29127266007832741|184983864797656587]}}]}","{trap_exit,true}","{status,running}","{heap_size,4185}","{stack_size,28}","{reductions,64816}"],""],"meta":{"pid":"<0.1362.0>","mfa":"{proc_lib,crash_report,4}","logger_formatter":{"title":"CRASH REPORT"},"line":"584","file":"proc_lib.erl","error_logger":{"type":"crash_report","tag":"error_report"},"domain":["otp","sasl"]},"level":"error","label":"{proc_lib,crash}"}
{"when":"2024-01-11T12:09:41.716895+00:00","unstructured_log":"Ranch listener 'ejabberd_cowboy_0.0.0.0_8088', connection process <0.1361.0>, stream 1 had its request process <0.1362.0> exit with reason {badrecord,ok} and stacktrace [{lasse_handler,terminate,3,[{file,\"/home/circleci/project/_build/default/lib/lasse/src/lasse_handler.erl\"},{line,106}]},{cowboy_loop,call,5,[{file,\"/home/circleci/project/_build/default/lib/cowboy/src/cowboy_loop.erl\"},{line,81}]},{cowboy_stream_h,execute,3,[{file,\"/home/circleci/project/_build/default/lib/cowboy/src/cowboy_stream_h.erl\"},{line,306}]},{cowboy_stream_h,request_process,3,[{file,\"/home/circleci/project/_build/default/lib/cowboy/src/cowboy_stream_h.erl\"},{line,295}]},{proc_lib,init_p_do_apply,3,[{file,\"proc_lib.erl\"},{line,241}]}]\n","meta":{"pid":"<0.1361.0>","error_logger":{"tag":"error"}},"level":"error"}
error.log
2024-01-11 12:09:41.715 [error] <0.1362.0>@lasse_handler:terminate:106 CRASH REPORT Process <0.1362.0> with 0 neighbours crashed with reason: bad record ok in lasse_handler:terminate/3 line 106
2024-01-11 12:09:41.717 [error] <0.1362.0>@lasse_handler:terminate:106 Cowboy stream 1 with ranch listener 'ejabberd_cowboy_0.0.0.0_8088' and connection process <0.1361.0> had its request process exit with reason: bad record ok in lasse_handler:terminate/3 line 106
Thank you for reporting the issue.
It seems that this is a problem with the Cowboy library, but we'll investigate, as we've already encountered some problems with SSE configuration in the past.
MongooseIM version: 6.2.0-44-g07a168ed1 Installed from: Helm Chart using
mongooseim/mongooseim:latest
image Erlang/OTP version: Erlang/OTP 26 [erts-14.1.1] [source] [64-bit] [smp:8:1] [ds:8:1:10] [async-threads:1] [jit:ns]Context
I am currently running MongooseIM in a Kubernetes cluster (EKS AWS) using the official Helm chart from ESL/MongooseHelm to provide a chat feature for my application. The client chat binding utilizes WebSocket to interact with chat. To make the chat interactive, a system bot is running and interacting with the chat using c2s XML stream, consuming and responding to user messages. Recently, I started a migration to GraphQL API to unify the communication approach on both sides and simplify the client implementation by achieving segregation between the consuming of messages and actions in the chat.
Issue
I have encountered a problem since migrating to GraphQL which is related to the SSE stream in MongooseIM. The problem occurs when broadcasting a large message stanza payload (~2000 characters) via GraphQL SSE stream. This action leads to a crash in the SSE stream which leads to the closing of EventSource, but the user's session becomes stuck in a live state (cannot be kicked) for an indefinite period of time. Subsequent reconnection attempts do not resolve the issue, and the SSE does not work (no messages broadcasted) for the affected user until the server is restarted. Meanwhile, WS and c2s don't crash on such messages and continue to work when SSE stream is broken.
Additional Information:
master
branch (docker imagelatest
tag), because I was struggling with SSE connection timeouts and they were addressed by merging the following MR: https://github.com/esl/MongooseIM/pull/4196.mod_muc
withmnesia
backend andmod_mam
withrdbms
backendAttachments:
mongooseim.json.1
error.log
crash.log
Questions
Might something be wrong on the configuration level in my case? Are there any suggestions to mitigate this issue?
Thank you for your attention to this matter. Let me know if further clarification or details are needed.