256dpi / arduino-mqtt

MQTT library for Arduino
MIT License
1.01k stars 230 forks source link

Cant connect to broker when using esp8266 version 3.1.1 #292

Closed nusabot-iot closed 1 year ago

nusabot-iot commented 1 year ago

Using Arduino IDE, and install board version 3.1.1. and call connect API, always return false.

it works when using version 3.0.x

mp-se commented 1 year ago

I have the same issue when trying to upgrade to 3.1.1

[update] This sample code works on the previous version but return error -9 on the v3.1.1 version of the esp8266 arduino fwk.

WiFiClient _wifi; MQTTClient mqtt(128);

mqtt.begin("myhost", 1883, _wifi); mqtt.connect("test", "", "");

if (mqtt.publish("test/topic", "value")) { Serial.printf("MQTT publish successful" CR); } else { Serial.printf("MQTT publish failed with error %d" CR, mqtt.lastError()); }

mqtt.disconnect();

mp-se commented 1 year ago

Here are some dumps with debugging on for the two frameworks: It looks like it fails to get the ACK due to timeout when sending the initial data to the server (my first thoughts)

These were the flags I added. -D DEBUG_ESP_HTTP_CLIENT
-D DEBUG_ESP_HTTP_SERVER
-D DEBUG_ESP_PORT=Serial
-D DEBUG_ESP_WIFI -D DEBUG_ESP_SSL -D DEBUG_ESP_CORE

Working one (older fwk)

.---- start testing for mqtt bug [hostByName] Host: 192.168.1.16 is a IP! :ref 1 :wr 18 0 :wrc 18 18 0 :ack 18 :rn 4 :rd 1, 4, 0 :rdi 4, 1 :rd 1, 4, 1 :rdi 3, 1 :rd 2, 4, 2 :rdi 2, 2 :c0 2, 4 :wr 19 0 :wrc 19 19 0 MQTT publish successful :wr 2 0 :wrc 2 2 0 :ack 19 :ack 2 :rcl pb=0 sz=-1 :abort :ur 1 :dsrcv 0 :del ---- end testing for mqtt bug

Failing (3.1.1)

.---- start testing for mqtt bug [hostByName] Host: 192.168.1.16 is IP! :ref 1 :wr 18 0 :wrc 18 18 0 :wustmo :close MQTT publish failed with error -9 :ur 1 :dsrcv 0 :del ---- end testing for mqtt bug

home-s-dev commented 1 year ago

I had the same problem, just with esp8266 core 3.1.1 (and not with 3.1.0 or previous 3.0.x)

The problem seems related to the change in delay() coming with ESP8266 core ver 3.1.1 Core

I'm not 100% sure, but changing https://github.com/256dpi/arduino-mqtt/blob/master/src/MQTTClient.cpp#L72 from delay(0) to delay(1) can fix. No other occurrence of "delay" is present in the code...

I hope this can help.

mp-se commented 1 year ago

You were spot on. Changing delay(0) to delay(1) fixes the problem with running under 3.1.1. I will do some more tests and issue a PR for this later on.

mp-se commented 1 year ago

I did some more testing. Trying to use delayMicroseconds instead of delay but that didnt work. I'm wondering if the issue is connected to timing and reading the result from the initial connection.

Changing from delay(1) to yield() also works. I'm not sure which option is best in this case.

Changing a behaviors' in the arduino core like this would probably break more libraries, so I'm curious if this was intentional or they introduced a new bug.

home-s-dev commented 1 year ago

If I have pieced together correctly, old version of delay(0) results in calling __esp_delay(0)

extern "C" void __esp_delay(unsigned long ms) {
    if (ms) {
        os_timer_setfn(&delay_timer, (os_timer_func_t*)&delay_end, 0);
        os_timer_arm(&delay_timer, ms, ONCE);
    }
    else {
        esp_schedule();
    }
    esp_suspend();
    if (ms) {
        os_timer_disarm(&delay_timer);
    }
}

and eventually in esp_schedule() followed by esp_suspend().

On the contrary, new version of delay(0) requires more steps with overload versions of esp_delay and esp_try_delay. In this last function, the argument passed to esp_delay (alias __esp_delay) can be modified according to recurrent scheduled functions (compute_scheduled_recurrent_grain):

bool esp_try_delay(const uint32_t start_ms, const uint32_t timeout_ms, const uint32_t intvl_ms) {
    uint32_t expired = millis() - start_ms;
    if (expired >= timeout_ms) {
        return true; // expired
    }

    // compute greatest chunked delay with respect to scheduled recurrent functions
    uint32_t grain_ms = std::gcd(intvl_ms, compute_scheduled_recurrent_grain());

    // recurrent scheduled functions will be called from esp_delay()->esp_suspend()
    esp_delay(grain_ms > 0 ?
        std::min((timeout_ms - expired), grain_ms):
        (timeout_ms - expired));

    return false; // expiration must be checked again
}

This manipulation can lead to a final call of __esp_delay with a non-zero argument. In this scenario, esp_schedule is not triggered.

The delayMicroseconds function does not yield to other tasks, so I assume it can't be a valid option.

On the other hand, yield (alias __yield) seems very similar to former delay(0)

extern "C" void __yield() {
    if (cont_can_suspend(g_pcont)) {
        esp_schedule();
        esp_suspend_within_cont();
    }
    else {
        panic();
    }
}

Sorry for the long post.

mp-se commented 1 year ago

If I have pieced together correctly, old version of delay(0) results in calling __esp_delay(0)

extern "C" void __esp_delay(unsigned long ms) {
    if (ms) {
        os_timer_setfn(&delay_timer, (os_timer_func_t*)&delay_end, 0);
        os_timer_arm(&delay_timer, ms, ONCE);
    }
    else {
        esp_schedule();
    }
    esp_suspend();
    if (ms) {
        os_timer_disarm(&delay_timer);
    }
}

and eventually in esp_schedule() followed by esp_suspend().

On the contrary, new version of delay(0) requires more steps with overload versions of esp_delay and esp_try_delay. In this last function, the argument passed to esp_delay (alias __esp_delay) can be modified according to recurrent scheduled functions (compute_scheduled_recurrent_grain):

bool esp_try_delay(const uint32_t start_ms, const uint32_t timeout_ms, const uint32_t intvl_ms) {
    uint32_t expired = millis() - start_ms;
    if (expired >= timeout_ms) {
        return true; // expired
    }

    // compute greatest chunked delay with respect to scheduled recurrent functions
    uint32_t grain_ms = std::gcd(intvl_ms, compute_scheduled_recurrent_grain());

    // recurrent scheduled functions will be called from esp_delay()->esp_suspend()
    esp_delay(grain_ms > 0 ?
        std::min((timeout_ms - expired), grain_ms):
        (timeout_ms - expired));

    return false; // expiration must be checked again
}

This manipulation can lead to a final call of __esp_delay with a non-zero argument. In this scenario, esp_schedule is not triggered.

The delayMicroseconds function does not yield to other tasks, so I assume it can't be a valid option.

On the other hand, yield (alias __yield) seems very similar to former delay(0)

extern "C" void __yield() {
    if (cont_can_suspend(g_pcont)) {
        esp_schedule();
        esp_suspend_within_cont();
    }
    else {
        panic();
    }
}

Sorry for the long post.

Great investigation, you really traced the code. The question is if this change is a bug in the new arduino framework and it will be fixed in future versions or not. I would regard this as a bug in the arduino framework since it changes the behaviour and is not backwards compatible.

256dpi commented 1 year ago

Thank you all for the detailed investigation into this issue!

I just released a new version (2.5.1) of the library that increases the delay to 1ms which hopefully results in a proper scheduler yield on all RTOS platforms.

I refrained from adding compiler conditionals to check the existence of "yield()" to keep the code simple for the time being. But, if this does not fix the issue or causes others, we can look into adding those as necessary.

256dpi commented 1 year ago

I'm closing this. Please reopen if necessary.