TCP read out-of-sync after exception

stefanb2 commented 4 weeks ago

I'm trying to integrate the code into my tree as Modbus TCP server. I started with the minimal configuration, ie. I flagged out all except one function implementation. While trying to test the illegal function exception I noticed that TCP socket read gets out-of-sync.

That is caused by the fact that only the frame header is read when handle_req_fc() sends the exception response. But the last bytes of the frame are still in the socket read buffer, hence the next call to nmbs_server_poll() will then get stuck, because there is an invalid and incomplete message header in the buffer. When the client sends the next frame, then

connection is aborted with NMBS_ERROR_INVALID_TCP_MBAP
the connection is closed by the server
the client is aborts with messages like "unexpected end of socket"

Browsing through the code there are some other places where frame processing is aborted early. Those code paths will have the same problem.

I don't know if there is a specification how a Modbus TCP client is supposed to act in case of an exception. MUST he close the socket immediately and use a new one for the next request?

I'm not sure how this should be corrected, because recv_msg_header() doesn't store the frame length. How about this approach

store frame length to nbms (as far as I understand that would only be valid for TCP)
decrement frame length in recv() in the NBMS_ERROR_NONE code path
if there are bytes left in the frame then send_exception_msg() should do a platform.read() (similar to nmbs_server_poll() after the exception was successfully sent.

stefanb2 commented 4 weeks ago

As a side note: I think this code in nmbs_server_poll() may not work as expected, because send_exception_msg() returns the result from send_msg() and not the exception error code itself:

    err = handle_req_fc(nmbs);
    if (err != NMBS_ERROR_NONE && !nmbs_error_is_exception(err)) {
        ...
        return err;
    }

stefanb2 commented 4 weeks ago

The proposed PR fixes the problem for me, but I have client code disabled.

stefanb2 commented 3 weeks ago

Another side note: the following code in Linux TCP example platform code is incorrect:

        if (ret == 1) {
            ssize_t w = write(fd, buf + total, count);

it should of course be:

            ssize_t w = write(fd, buf + total, count - total);

pseudotronics commented 2 weeks ago

It seems like this only would be an issue when the client polling rate is higher than the server's internal polling rate. I think in most applications this isn't the case which is probably why I have not seen this problem. I also don't use POSIX sockets.

Flushing the remainder of the frame seems like the logical thing to do here though.

stefanb2 commented 2 weeks ago

Maybe my wording was imprecise, but this has nothing to do with the Socket API, just TCP. I have no idea what the sender transmit or recipient processing speed has to do with this.

TCP presents a bi-directional, unstructured stream of octets to the application, no matter what the actual API is. The structure on-top of that stream (AKA protocol) is the responsibility of the application.

That is the reason why the Modbus TCP PDU has a length field (RTU must have some other method). But that length field is not available to the low-level platform read function, so frame handling needs to be implemented by the nanoMODBUS code.

Sorry for sounding like a "Network 101" text book :smiley:

pseudotronics commented 2 weeks ago

I have no idea what the sender transmit or recipient processing speed has to do with this

I was just saying that subsequent calls to nmbs_server_poll would effectively 'flush' the socket. If the server polls significantly faster than the client is sending commands then there would never be an instance where the socket has two frames of data. If you only call nmbs_server_poll when you have a receive event, then this is definitely a problem.

I imagine in my application nmbs_server_poll is returning NMBS_ERROR_INVALID_TCP_MBAP when the function code is not recognized for a few subsequent calls to nmbs_server_poll. I just don't care what nmbs_server_poll returns.

nanoMODBUS doesn't control your connections so I was confused as to why:

connection is aborted with NMBS_ERROR_INVALID_TCP_MBAP

the connection is closed by the server

How you respond to errors returned by nmbs_server_poll is up to you.

Not trying to bash your issue, I was just offering up some speculation as to why this this problem has gone unnoticed.

stefanb2 commented 2 weeks ago

I was just saying that subsequent calls to nmbs_server_poll would effectively 'flush' the socket.

That's the point: it doesn't do that. After the first PDU parsing is aborted too early, the framing is lost and all subsequent PDUs can't be parsed correctly anymore. So the client gets incorrect responses to correct PDUs.

If the server polls significantly faster than the client is sending commands then there would never be an instance where the socket has two frames of data.

Once again: the processing speed doesn't have anything to do with this. On the TCP stream there are no "frames", just octets. It is up to the application to recognize the frames and read each of them completely.

How you respond to errors returned by nmbs_server_poll is up to you.

There is not enough information for the caller to handle the situation, i.e. the only thing he can do is to close the connection. But that means in the end that the client always looses the connection after it sends a PDU whose processing is aborted early due to some error condition.

Of course the simple solution is to say: any client that sends a PDU that the server rejects with an exception is broken. Not sure how far you can get with this in real life...

pseudotronics commented 2 weeks ago

I still don't think you understand what I'm trying to say. You seem to be taking it way too personal. So sure, whatever you say.

stefanb2 commented 2 weeks ago

It seems like this only would be an issue when the client polling rate is higher than the server's internal polling rate. I think in most applications this isn't the case which is probably why I have not seen this problem. I also don't use POSIX sockets.

I think I finally figured out the implicit assumption behind this statement. You are stating that the contract for the nmbs_server_poll() API includes the requirement that on return the caller needs to remove any left-over bytes from the platform transport before the next call to nmbs_server_poll().

I can't find anything about this in the nanoMODBUS documentation
the example code for a TCP server doesn't reflect this, i.e. bytes are only read from the platform transport in the context of a nmbs_server_poll() call.

debevv commented 2 weeks ago

Yes, the reason why the library behaves this way is basically RTU. As @pseudotronics explained, the expectation behind this design is that the user will repeatedly call nmbs_server_poll(), and eventually the server will be in sync again. This is the best that we can do (apart from strategic socket flushes) on RTU, where there isn't a length field, and we must defend from bad/random incoming data. Actually, the official RTU spec describes a system based on timers to space out bytes and messages, but from what I understood nobody actually implements it.

However, in TCP we do have a length field, and the server would be much more reliable if we didn't throw that info away and use it to try to read the whole message all at once. Besides, it feels a bit stupid to put ourselves in an out-of-sync situation by leaving data in the socket when responding whit an exception. It works anyway, but I think it's the time to make everything more solid.

I will have a look at it in the next days because I want to introduce the change more holistically. Thanks to both of you for the constructive discussion.

stefanb2 commented 2 weeks ago

Yes, the reason why the library behaves this way is basically RTU. As @pseudotronics explained, the expectation behind this design is that the user will repeatedly call nmbs_server_poll(), and eventually the server will be in sync again.

Thanks for the explanation. This clarifies

why the TCP server example takes the "do nothing" approach when nmbs_server_poll() returns an error, and
the timeout= and retries= parameters for the Modbus TCP client class of the Python package I used for my test script.

For my initial integration I followed the SOP for socket servers to drop the client connection when the application protocol layer reports an error. I have to admit that AFAIR Modbus is the first request/response protocol I encountered with this odd behavior, i.e. that on certain error conditions the server doesn't send a response, leaving the client to play a guessing game.

For the product I'm working on I will not have any control over the client, so I have to make sure that my server behaves as expected by any client out there.

debevv / nanoMODBUS

TCP read out-of-sync after exception #67