OttoWinter / AsyncTCP

Async TCP Library for ESP32
GNU Lesser General Public License v3.0
4 stars 5 forks source link

Fix race condition causing 'ack timeout 4' disconnects #4

Closed mmakaay closed 3 years ago

mmakaay commented 3 years ago

A pull request, as requested by @OttoWinter in the https://github.com/esphome/issues/issues/1921#issuecomment-817143522

The AsyncClient::send() methods sets a boolean to true after pushing data over the TCP socket successfully using tcp_output(). It also sets a timestamp to remember at what time the data was sent.

The AsyncClient::_sent() callback method reacts to ACKs coming from the connected client. This method sets the boolean to false.

In the AsyncClient::_poll() method, a check is done to see if the boolean is true ("I'm waiting for an ACK") and if the time at which the data was sent is too long ago (5000 ms). If this is the case, a connection issue with the connected client is assumed and the connection is forcibly closed by the server.

The race condition is when these operations get mixed up, because of multithreading behavior. The _sent() method can be called during the execution of the send() method:

  1. send() sends out data using tcp_output()
  2. _sent() is called because an ACK is processed, sets boolean to false
  3. send() continues and sets boolean to true + timestamp to "now"

After this, the data exchange with the client was successful. Data were sent and the ACK was seen. However, the boolean ended up as true, making the _poll() method think that an ACK is still to be expected. As a result, 5000 ms later, the connection is dropped.

This commit fixes the code by first registering that an ACK is expected, before calling tcp_output(). This way, there is no race condition when the ACK is processed right after that call.

Additionally, I changed the boolean to an integer counter value. The server might send multiple messages to the client, resulting in multiple expected ACKs. A boolean does not cover this situation.

Misiu commented 3 years ago

@mmakaay what about ESPAsyncTCP-esphome? AsyncTCP is used on ESP32, but ESPAsyncTCP is used on ESP8266 (ref: https://github.com/esphome/esphome/pull/1693/files#diff-1988fae5ec01f4c733d17749246f0facf70a29738259590d1bdf9225ad10cd3dR10-R15), that many of us still use. I'm not sure if this error occurs there (I don't have the knowledge to tell that), maybe not, but I thought it's worth asking :)

mmakaay commented 3 years ago

I have looked at ESPAsyncTCP while I was tracking this issue, and it does use the same construction to keep track of ACKs. I concluded that is might therefore suffer from the same kind of disconnection issues (does it?).

However, a lot has been done to that code, and I saw some very specific comments about possible crashes and how to prevent them. The two libraries looked so different at places, that I could not decide right away if the same fix was needed. If this type of disconnect error is occurring for ESP8266 too, then I'm pretty sure that the same fix can be applied.

Somewhere in the back of my mind, I made a mental note about checking out the actual differences between these two libraries, because when looking at them, I wondered if the ESP32 AsyncTCP library wouldn't benefit from some of the developments in the ESPAsyncTCP library.

mmakaay commented 3 years ago

Hey @OttoWinter , based on the revert of the AsyncTCP version bump by @glmnet : Do you agree it would be best to revert the patch from this pull request? It's no good if there are devices that show regression in the form of extra disconnects.

OttoWinter commented 3 years ago

@mmakaay Agree, though it would make sense to understand how that regression could be caused too. Maybe we're encountering another race condition now because some code is executed on another ESP32 core in parallel (or interleaved execution), so it might make sense to use mutexes in this code.

mmakaay commented 3 years ago

Yes, I think that mutexes might improve this. I have tried the code on all of my devices and for me this improved their API connections. So I wasn't able to trigger the regression myself. What would be interesting to know, is what exact messages are logged on the console at the time of the disconnects. Since that would tell us if the disconnects were related to the ACK handling.

What I didn't look into further so far, is how this setup would work when actual ACKs got missing. Where are the retransmits for this arranged? Should lwIP handle these? The AsyncTCP ACK handling is considering a sunny day scenario, where no ACKs get lost, and considers a lost ACK as a reason to kick the connection. That is not the best scenario to work with. Possibly the regression hits when there are actual packages getting lost on the network? It shouldn't kill the connection in such case though, since the whole reason for TCP is allowing for such issues. From a TCP perspective, the connection might be considered sub-optimal, but very much usable nonetheless.

I can dive a bit deeper in the stack and see what lwIP actually does. If that layer does handle ACKs and retransmits, I'm not so sure that ACK handling event ought to be in AsyncTCP to begin with.

OttoWinter commented 3 years ago

Yes lwIP does handle retransmits :)

iirc the reason for it is for cleanup purposes. With the ack logic you know when it's safe to write to a buffer again (if the buffer wasn't copied to lwIP's internal one before). But not 100% sure about that anymore.

glmnet commented 3 years ago

I managed to reproduce this issue by introducing a delay in the send

bool AsyncClient::send()
{
    int8_t err = ERR_OK;
    err = _tcp_output(_pcb, _closed_slot);
    if (err == ERR_OK)
    {
        vTaskDelay(50 / portTICK_RATE_MS);
        // The 50 is to give the network plenty of time to send the ack,
        // I tried with 25 and there were no timeouts
        _pcb_busy = true;
        _pcb_sent_at = millis();
        return true;
    }
    return false;
}

And fixed it with this (@mmakaay ideas)

bool AsyncClient::send()
{
    int8_t err = ERR_OK;
    auto pcb_sent_at_back = _pcb_sent_at;
    auto pcb_busy_back = _pcb_busy;
    _pcb_sent_at = millis();
    _pcb_busy = true;
    err = _tcp_output(_pcb, _closed_slot);
    if (err == ERR_OK)
    {
        return true;
    }
    _pcb_busy = pcb_busy_back;
    _pcb_sent_at = pcb_sent_at_back;
    return false;
}

also marked _pcb_busy and _pcb_sent_at as volatile

  volatile bool _pcb_busy;
  volatile uint32_t _pcb_sent_at;

_pcb_busy is there just to tell us that _pcb_sent_at has a valid value, and ack timeout should be observed, no harm at all, so both values are backup in case tcp_output fails. In case a timeout is going on and tcp_output fails, the timeout will be "delayed" but values will be reverted and thus the timeout will happen even if send is called quickly

mmakaay commented 3 years ago

Was making them volatile required to make it work correctly?

Last night, I came up with a slighly different and possibly cleaner fix. I will explain my code below here.

Before I explain the code, here a consideration that I included in this change: On thing I found weird, is that the ACK check presumably does a check on ACKs, but it uses _rx_last_packet for this check. That field is updated on both incoming data packets and incoming ACKs. Thus, we might still be waiting for an ACK, but an incoming data packet would clear the timeout. Likewise, the RX timeout would be cleared by an incoming ACK, however this might be fine (when the check implements the use case "check if we see any traffic coming in from the client").

On to the code changes:

This makes the block of related fields from the AsyncTCP.h:

    bool _ack_pcb;
    uint32_t _tx_last_packet;
    uint32_t _rx_ack_len;
    uint32_t _rx_last_packet;
    uint32_t _rx_since_timeout;
    uint32_t _rx_last_ack;
    uint32_t _ack_timeout;
    uint16_t _connect_port;

A few side-remarks:

I removed all uses of the _pcb_busy. Next to that, the code from AsyncTCP.cpp that I changed is:

A much simplified send method, which now register the last transmitted packet if a send succeeds.

bool AsyncClient::send(){
    auto backup = _tx_last_packet;
    _tx_last_packet = millis();
    if (_tcp_output(_pcb, _closed_slot) == ERR_OK) {
        return true;
    }
    _tx_last_packet = backup;
    return false;
}

The method that handles an incoming ack now updates the _rx_last_ack:

int8_t AsyncClient::_sent(tcp_pcb* pcb, uint16_t len) {
    _rx_last_packet = millis();
    _rx_last_ack = millis();
    // ... snip
}

The logic for the ACK timeouts now checks if the last incoming ack time is before the last sent packet time. That eiliminates the need for a boolean to enable the ACK timeout check.

int8_t AsyncClient::_poll(tcp_pcb* pcb){
    // ... snip
    // ACK Timeout
    if(_ack_timeout && _tx_last_packet > _rx_last_ack && (now - _tx_last_packet) >= _ack_timeout){
        log_w("ack timeout %d", pcb->state);
        if(_timeout_cb)
            _timeout_cb(_timeout_cb_arg, this, (now - _tx_last_packet));
        return ERR_OK;
    }
    // ... snip
}

This code is giving me very stable connections. I have 8 OTA loggers and HA active, and they now run as stable for me as when I fully disable the ACK timeout checks.

Do you want me to bundle this up as a new pull request?

glmnet commented 3 years ago

The volatile added just by instinct.

Regarding your code you should check your timestamp comparison. Otherwise looks good to me.

_tx_last_packet > _rx_last_ack

should check for rollovers, see https://arduino.stackexchange.com/questions/12587/how-can-i-handle-the-millis-rollover

mmakaay commented 3 years ago

Good point. I was wondering if it would be worth it, but let's prevent that once per 50 days possible disconnect issue :+1: How about this?

    if(_ack_timeout){
        uint32_t one_day = 86400000;
        bool last_tx_is_after_last_ack = (_rx_last_ack - _tx_last_packet + one_day) < one_day;
        if(last_tx_is_after_last_ack && (now - _tx_last_packet) >= _ack_timeout) {
            log_w("ack timeout %d", pcb->state);
            if(_timeout_cb)
                _timeout_cb(_timeout_cb_arg, this, (now - _tx_last_packet));
            return ERR_OK;
        }
    }
mmakaay commented 3 years ago

Okay, a followup pull request was created at https://github.com/OttoWinter/AsyncTCP/pull/5

I did not include the volatile flags, because I do not think they would be useful. Reading up on the combination of volatile plus threading behavior in this Stack Overflow post didn't convince me they would. I you do want to have those variables volatile anyhow, let met know in the new PR and I can update it.