gen-smtp / gen_smtp

The extensible Erlang SMTP client and server library.
Other
683 stars 266 forks source link

Add better handling of crashes on socket closure by client #270

Open saleyn opened 3 years ago

saleyn commented 3 years ago

Add functionality to terminate gen_server gracefully in case of SSL handshake errors or premature closing of socket by an SMTP client.

seriyps commented 3 years ago

I wonder if it is really necessary to use funs in the scenarios where you use send_and_run? Maye it's possible to have just linear "happy path" and then signal errors with throw? Or I'm missing smth?

It just becomes a bit difficult to reason about the order in which code lines are executed looking at the functions using send_and_run

saleyn commented 3 years ago

The main issue is that when errors are thrown, the error log has a ton of details pertaining to the crashed gen_server, whereas in the majority of cases, especially when the server closes the socket after ELHO or similar scenarios, we don't really need to put the crash details in the log. Generally, I think the current implementation needs a major revision to clean up and simplify that logic, but this is merely a patch to make the existing code more usable.

seriyps commented 3 years ago

@saleyn the thing is that throw from gen_server's callback is kind of equivalent to just returning from the gen_server:

https://github.com/erlang/otp/blob/a83541527625fae5d7f849718a052cf6f2833809/lib/stdlib/src/gen_server.erl#L697-L698

so

handle_cast(..., State) ->
   ...
   {noreply, NewState}.

is equivalent of

handle_cast(..., State) ->
   ...
   throw({noreply, NewState}),
   ....

(same works for handle_call/handle_info and I think even init).

Also, if you throw({stop, normal, State}) it would stop the process, but normal exit reason (so, no error logs, no signal propagation).

When I implemented send/2 function making it throw {stop, {send_error, ...}, St} I might actually just change it to throw {stop, normal, St}, but decided to better let user-defined handle_error decide (handle_error callback could return {stop, normal, St} instead of returning ok or {stop, Error, St}), but I made it more explicit (by generating error log) by-default.

User can define

-behaviour(gen_smtp_server_session).

handle_error(send_error, Err, St) ->
  logger:debug("socket send error ~p", [Err]),
  {stop, normal, St};
handle_error(_, _, St) ->
  {ok, St}.

to suppress error_logger reports when gen_smtp_server_session:send/2 fails. It's probably should be better documented though...

saleyn commented 3 years ago

If my memory serves me, a while back when I was implementing an SMTP server based on gen_smtp, the issue was that there were many places were gen_tcp/ssl's send/2 function would return an error, and this would result either in a failed match ok = gen_tcp:send(...) or erlang:error/1 someplace, which obviously is not same thing as an erlang:throw/1 and would leak from the gen_server's exception handler, crashing the process.

I recall seeing this behavior on ELHO and QUIT messages, e.g. when the server was trying to send QUIT, by which time a client already closed the socket. Hence, I had to implement some more refined handing of those edge cases to eliminate useless log reports of process crashes.

seriyps commented 3 years ago

@saleyn

a while back when I was implementing an SMTP server based on gen_smtp

depends on how long ago, because I made 2 PRs where error handling have been noticeably improved relatively recently: #211 adds handle_error user-defined callback and #226 is exactly about handling send errors; both PRs were merged in mid-2020

saleyn commented 3 years ago

depends on how long ago, because I made 2 PRs where error handling have been noticeably improved relatively recently: #211 adds handle_error user-defined callback and #226 is exactly about handling send errors; both PRs were merged in mid-2020

I believe it was more or less at the end of 2020 using the HEAD of master branch.

seriyps commented 3 years ago

Oh, that's strange. Looking at the code, it seems the only place where we are not using gen_smtp_server_session:send/2 to send server's replies is init callback (becaue return tuple format there is slightly different I think). And all the send/2 errors whould be possible to suppress by returning {stop, normal, State} from handle_error callback of gen_smtp_server_session.

I would rally appreciate if you could provide some small example where we could reproduce it!

saleyn commented 3 years ago

I've had this patched code running in production for the good part of this year, and haven't seen these issues. So it would be hard to reproduce without rolling back to the unpatched version, which is not something I could do any time soon. So you can table this PR for now, I guess. If I have more updates on this, I'll let you know.

seriyps commented 3 years ago

@saleyn I also have gen_smtp master server running in production with the following customized handle_error/3 callback:

handle_error(Kind, [], State) when Kind == tcp_closed; Kind == ssl_closed ->
    %% It's normal; no need to log/metric
    {ok, State};
handle_error(out_of_order, <<"DATA">>, #state{orig_to = To, inbox = undefined} = State) when
    is_binary(To)
->
    %% RCPT TO and DATA are sent in pipeline, but rcpt failed due to recipient address not found
    {ok, State};
handle_error(Kind, Details, #state{peer = Ip} = State) ->
    tm_metrics:count_inc([smtp, error, total], 1, #{labels => [Kind]}),
    ?LOG_WARNING(
        #{
            tag => "Session error",
            kind => Kind,
            details => Details,
            state => State
        }
    ),
    case Kind of
        ssl_handshake_error ->
            tm_smtp_tls_blacklist:report(Ip),
            %% Force stop
            {stop, normal, State};
        ssl_error ->
            tm_smtp_tls_blacklist:report(Ip),
            % will stop anyway
            {ok, State};
        send_error when Details =:= closed ->
            %% seems lots of servers close the connection prematurely (without sending QUIT)
            {stop, normal, State};
        _ ->
            {ok, State}
    end.

and I indeed suppress send_error by returning normal as stop reason. But I did it after I made some analysis of the traffic I receive; I'm not sure we should suppress them by-default, because it may make debugging more difficult. Maybe we should add more tips to the smtp_server_example docstring suggesting which errors are a good candidate for being suppressed after initial analysis?

saleyn commented 3 years ago

It certainly would be helpful to include this in the example server, and mention your comments from this thread there.