arduino / ArduinoCore-mbed

330 stars 195 forks source link

Socket Write and Close Hangs Indefinitely When There Is No Connectivity #905

Closed alvarolb closed 2 months ago

alvarolb commented 3 months ago

Hi,

I am not sure why this problem has not been reported before, but there is a serious issue when using TCP/TLS sockets. I think that the current write implementation is wrong. This is the related code from MbedClient.cpp:

size_t arduino::MbedClient::write(const uint8_t *buf, size_t size) {
  if (sock == nullptr)
    return 0;

  sock->set_blocking(true);
  sock->set_timeout(SOCKET_TIMEOUT);
  int ret = NSAPI_ERROR_WOULD_BLOCK;
  do {
    ret = sock->send(buf, size);
  } while ((ret != size && ret == NSAPI_ERROR_WOULD_BLOCK) && connected());
  configureSocket(sock);
  return size;
}

The problem is that if you turn off the WiFi router or remove the Ethernet cable while a socket is open, a normal write to this socket will hang indefinitely. In this situation, the send function will always report NSAPI_ERROR_WOULD_BLOCK, and the connected() method will always return true. As a result, the loop remains indefinitely, blocking the application. If the Ethernet cable is reconnected, the application resumes normal operation. However, this is not the desired behavior, as the application cannot detect disconnections.

I have slightly modified the method, and it seems to work:

size_t arduino::MbedClient::write(const uint8_t *buf, size_t size) {
  if (sock == nullptr)
    return 0;

  sock->set_blocking(true);
  sock->set_timeout(SOCKET_TIMEOUT);
  int ret = sock->send(buf, size);
  configureSocket(sock);
  if(ret==NSAPI_ERROR_WOULD_BLOCK) return 0;
  return size;
}

I'm not sure why the code included this while loop, as the socket is configured to block. If no bytes are written after the specified timeout, it will always return NSAPI_ERROR_WOULD_BLOCK. For more information, refer to the documentation: mbed-os TCP Socket.

By default, send blocks until all data is sent. If socket is set to non-blocking or times out, a partial amount can be written. NSAPI_ERROR_WOULD_BLOCK is returned if no data was written.

Additionally, I’m not sure why we have to call configureSocket(sock) repeatedly after each write. This could potentially impact performance.

void arduino::MbedClient::configureSocket(Socket *_s) {
  _s->set_timeout(_timeout);
  _s->set_blocking(false);
  _s->getpeername(&address);

  if (event == nullptr) {
    event = new rtos::EventFlags;
  }
  if (mutex == nullptr) {
    mutex = new rtos::Mutex;
  }
  mutex->lock();
  if (reader_th == nullptr) {
    reader_th = new rtos::Thread(osPriorityNormal - 2);
    reader_th->start(mbed::callback(this, &MbedClient::readSocket));
  }
  mutex->unlock();
  _s->sigio(mbed::callback(this, &MbedClient::getStatus));
  _status = true;
}

Moreover, if we detect a write failure in the application (via the modified write method) and try to close the socket, it also hangs indefinitely (only on TLS). I suspect that the current mBed configuration included in the Arduino build uses the default behavior when closing a TLS connection, which involves sending a close notify to the remote server. This close notify requires a connection to work, but since the connection is not available, it hangs forever, just like the previous write. This issue is also reported in this ARMmbed topic:

https://github.com/ARMmbed/mbed-os/issues/15209

This is the close method from TLSWrapper.cpp using the mbed_tls_close_notify:

nsapi_error_t TLSSocketWrapper::close()
{
    if (!_transport) {
        return NSAPI_ERROR_NO_SOCKET;
    }

    tr_info("Closing TLS");

    int ret = 0;
    if (_handshake_completed) {
        _transport->set_blocking(true);
        ret = mbedtls_ssl_close_notify(&_ssl);
        if (ret) {
            print_mbedtls_error("mbedtls_ssl_close_notify", ret);
        }
        _handshake_completed = false;
    }

    if (_close_transport) {
        int ret2 = _transport->close();
        if (!ret) {
            ret = ret2;
        }
    }

    _transport = NULL;

    return ret;
}

I would definitely recommend to skip the close notify, or at least configure a timeout.

domfie commented 2 months ago

Finished my application with the Opta, started testing for basic things like what happens if ethernet goes down due to switch failure and stumbled upon this issue. Huge disappointment. If arduino claims to do something industrial, it should definitely have been tested to work even if parts of external systems failing.

After applying your patch it no longer hangs during mqttclient.poll(). Thank you @alvarolb