erlyaws / yaws

Yaws webserver
https://erlyaws.github.io
BSD 3-Clause "New" or "Revised" License
1.28k stars 267 forks source link

yaws is stucked and can not receive any new HTTP connection #492

Open yaocl opened 2 months ago

yaocl commented 2 months ago

We used yaws server with embedded mode and the server can be started to process some requests. HAProxy is used in front end to monitor yaws cluster nodes and receive/dispatch new HTTP webpage/websocket connections.

yaws node will be stucked suddenly and can not receive any new HTTP connection.

We tried to use observer_cli to check the erlang VM node and found the stucked node:

  1. yaws_server process stucked a Messsage: [{inet_async, #Port<0.14205>, 4242, {ok, #Port<0.17083>}}] the Message is not processed and flushed
  2. the function call is yaws_server:acceptor0/2 , Function: prim_inet:accept0/2

We had tested yaws version 2.0.6, 2.0.9 and tried to change the erlang version 20, 25. The problem always happened randomly.

截圖 2024-07-09 11 37 11

截圖 2024-07-09 11 38 22

截圖 2024-07-09 11 39 19

vinoski commented 2 months ago

First thing you might try is building from the yaws master branch to see if you can reproduce the problem with the latest code.

vinoski commented 2 months ago

For this observer image it would be good if you could provide all the process info you can get from observer for pid <0.767.0>. Undoubtedly that pid will vary for each server execution, but I'm interested in any stack trace you can get for the process that might help determine why it's not receving the queued message.

vinoski commented 2 months ago

The same image referenced in the previous comment shows a {inet_async, ...} message. I don't believe yaws calls anything that would generate such a message. I therefore assume you're running your own code within the same virtual machine as yaws — can you explain what it does? Are you running some sort of distributed multi-node system?

yaocl commented 2 months ago

The 2nd snapshot is the process info of <0.767.0>. We used yaws with our code and activated it using the function yaws:start_embedded. We had two VM nodes running the same code. And the mnesia tables are copied and synced between the nodes.

I think inet_async is created by the erlang VM to receive a tcp connection from acceptor in OS. The yaws stuck with this message therefore the new tcp connection is not processed. But we had no idea why yaws gen server stuck and can not process new messages.

vinoski commented 2 months ago

For the second image it would be good to also see the Current Stack tab for that process.

yaocl commented 2 months ago

theses are other infos of process <0.767.0>

截圖 2024-07-09 11 39 36

截圖 2024-07-09 11 39 48

vinoski commented 2 months ago

Thanks. Given that yaws doesn't make any calls itself that would result an inet_async message reply, it's likely that your application is doing something within a yaws process that causes such a message to arrive in this inbox. The fact that you're using yaws in embedded mode and that no other user has, as far as I can recall, ever reported a problem like this would seem to reinforce the idea that this is an application issue. Even so, I'm curious as to how this can happen, so I'm wondering if you can supply more details about what your application code is doing, especially within processes that are controlled by yaws and are dispatching into your code.

yaocl commented 1 month ago

This is another case today which is attached below.

I cannot provide the code since it's a company product. The application primary uses web socket to process client data. The client web socket processes can transmit data between theses websock clients. There are some websocket connections from another java appllication and theses connections are always connected to stand by process data. In this case, there are 44 standby web socket processes.

2 erlang cluster nodes running yaws in two different machines. The websocket process Pids are stored in mnesia table and synced between the nodes. When one web socket client need to send data to other clients. We have to looking for process Pids in mnesia table and send data to these Pids.

HAproxy is in front of the yaws nodes. It is configured to use "option httpchk *" to check yaws online status every 5 seconds. So there are http connections that happened from HAProxy.

截圖 2024-07-15 08 41 49

截圖 2024-07-15 08 42 11

截圖 2024-07-15 08 42 21

截圖 2024-07-15 08 42 35

vinoski commented 1 month ago

When your websocket code is finished, does it perform a socket close? If not, can you try adding a call to yaws_api:websocket_close/1 to your code?

yaocl commented 1 month ago

The server does not close the websocket connection but always wait for client connection opened/closed.

Based on the server log, when some websocket connection are closed by clients, the websocket callback module function are invoked:

handle_message({close, Status, Reason}, _)

terminate(Reason, State)

Should I put yaws_api:websocket_close/1 in terminate/2 ? terminate/2 seems to be the yaws websocket callback function.

After testing these days, we had found that "clamscan" in ClamAV may cause more yaws_server:acceptor0/2 to pending the inet_async Message. When "clamscan" is execued, the CPU loading and Memory usage are extremely high. We've tried to use nice, ionice, cpulimit to limit the clamscan and need some days to verify the result.

/usr/bin/nice -n 19 /usr/bin/ionice -c2 -n7 /usr/bin/clamscan -i -r /directory_to_scan >> /var/log/clamscan/clamscan_daily.log &
sleep 5
cpulimit -e clamscan --limit 10 --lazy &

"clamscan" is execused at 4:20AM in daily cronjob, but the yaws_server:acceptor0 process pending "inet_async" message still happened when we monitor the VM in the day time.

Sometimes the VM has 1 or 2 yaws_server:acceptor0 processes that pending the inet_async Message, but the yaws still accepts new http connection. Is it possible monitor the incorrect pending message process and kill these processes?

vinoski commented 1 month ago

Apologies for not being clear. Rather than making an API call, if your websocket callback module receives any sort of message particular to your application in its handle_message/2 function indicating that the websocket connection is no longer needed, please return {close, normal} from handle_message/2. Also, if handle_message/2 receives a {close, Status, Reason} message, please return {close, Status}.

yaocl commented 1 month ago

It's written this way

handle_message({close, Status, Reason}, _) ->
  lager:debug("Close connection: status=~p reason=~p", [Status, Reason]),
  {close, Status}.
vinoski commented 1 month ago

OK, that's good, but I assume the application protocol you've created for your application on top of websocket also has a shutdown or closing aspect, and the message you use to indicate that also needs to be handled with {close, normal}. See the basic callback example code for an example.

yaocl commented 1 month ago

Our code follows basic_echo_callback_extended.

What's the difference between handle_message/1 and handle_message/2 ? Do we need to add a handle_message/1 in code basic_echo_callback_extended module?

vinoski commented 1 month ago

Ok, since you're using the extended example, do you have a handle_message/2 that returns a close tuple, similar to the one in the example?

The handle_message/1 function gets only a message as an argument, whereas handle_message/2 gets a message and the callback module state as arguments.

yaocl commented 1 month ago

We had found the problem source comes from https://github.com/erlyaws/yaws/blob/0c1d25e59850e5cb7fcc3f9f058af3e3997c0703/src/yaws.erl#L2736

yaws tried to get all http headers. When it invokes do_recv as Count =:= 0, but do_recv returns nothing. It looks like gen_tcp:recv in do_recv does not timeout and return error.

http_collect_headers(CliSock, Req, H, SSL, Count) when Count < 1000 ->
    setopts(CliSock, [{packet, httph}, {packet_size, 16#4000}], SSL),
    Recv = do_recv(CliSock, 0, SSL),

The problem is not about websocket. It's caused by haproxy regular httpchk to check if the server is alive. And the problem is not happened immediately when the server starts. We've tried to decrease the httpchk interval from 5 seconds to 200ms in two servers. Sometimes it happened after 1 hour and sometimes 2 or 3 hours.

Some posts suggest to use prim_inet:recv but prim_* modules are internal modules in erlang SDK.

vinoski commented 1 month ago

But surely httpchk is sending a legal http request? It seems to send an OPTIONS request by default, which Yaws can handle. It would be interesting to enable Yaws traffic tracing to see what it's receiving. You can do that either by passing -t to yaws when you run it, or by configuring trace = traffic in the global part of your yaws.conf file. Traces will be sent to trace log files, siblings of other regular Yaws log files. Alternatively you can trace only HTTP traffic with -T or with trace = http.