arduino / ArduinoCore-mbed

345 stars 199 forks source link

TCP/IP subtly broken on Opta (and possibly other platforms) #978

Open fogzot opened 5 days ago

fogzot commented 5 days ago

Some changes done after 4.1.3 introduced a very subtle bug that garbles the data sent over TCP/IP, for example using HttpClient. When using the following code:

    httpClient->beginRequest();
    httpClient->get("/opta/002E003D3532511134313632/cfg");
    httpClient->sendHeader("X-DNDG-Stamp", String(84814));
    httpClient->sendHeader("X-DNDG-SerialNumber", "002E003D3532511134313632");
    httpClient->sendHeader("X-DNDG-Applicationname", "ParkinGreen");
    httpClient->sendHeader("X-DNDG-FirmwareVersion", String(2));
    httpClient->sendHeader("X-DNDG-PublicKey", "2nNuFMzeh+ixOWo8bD4qUTpIN4KRf1O+zcFXJaSTKkcShr/lJdgVTxugFiH1SBEAe0caSUb3jw1Cs1o1PhmtgA==");
    httpClient->sendHeader("X-DNDG-Signature", "vKFYrGgFdOP9KRQa7eYHEEa0DFl1tnIfIsYV/eFnCAh2U/JUjYDkUQspE6dUTkiPx+yM//sb5WV2ybp7kGlVzA==");
    httpClient->endRequest();

    Serial.print("Result: ");
    Serial.println(httpClient->responseStatusCode());

    httpClient->stop();

sometimes on the server side we receive partially corrupted content:

X-DNDG-Stamp: 84814
X-DNDG-SerialNumber: 002E003D3532511134313632
X-DNDG-FirmwareVersion2: 
X-DNDG-ApplicationName: ParkinGreen
X-DNDG-PublicKey: 2nNuFMzeh+ixOWo8bD4qUTpIN4KRf1O+zcFXJaSTKkcShr/lJdgVTxugFiH1SBEAe0caSUb3jw1Cs1o1PhmtgA==
X-DNDG-Signature: vKFYrGgFdOP9KRQa7eYHEEa0DFl1tnIfIsYV/eFnCAh2U/JUjYDkUQspE6dUTkiPx+yM//sb5WV2ybp7kGlVzA==

Note the X-DNDG-FirmwareVersion2: line, that should have been X-DNDG-FirmwareVersion: 2. This happens more frequently when using an encrypted connection but also happens when using a plain EthernetClient (so it is not caused by HttpClient). Also, this happens more frequently when sending "large" payloads (> 100 bytes) in a short period of time.

We think this broke when the sockes writes were switched from blocking to non-blocking to address the "ethernet disconnected" problem without explicitly checking for the EAGAIN result to correctly retry and serialize the writes.

JAndrassy commented 5 days ago

could you try this changes https://github.com/arduino/ArduinoCore-mbed/pull/912/files ?

pennam commented 5 days ago

looks strange to me that https://github.com/arduino/ArduinoCore-mbed/commit/f7d9db986ecd9e7323b2839dc61b9eab22239e22 can cause this results. From your log all data reaches the server but is scrambled in a weird way....

Did you try to revert https://github.com/arduino/ArduinoCore-mbed/commit/f7d9db986ecd9e7323b2839dc61b9eab22239e22 ? Do you have a sketch to reproduce the issue?

fogzot commented 5 days ago

looks strange to me that f7d9db9 can cause this results. From your log all data reaches the server but is scrambled in a weird way....

Sometime, other times some parts are just missing.

The code in the change you linked still has a sock->set_blocking(true); but the code that has the problem is missing that line:

https://github.com/arduino/ArduinoCore-mbed/blob/103d4abc261fda45956112425df66f22d6d69481/libraries/SocketWrapper/src/MbedClient.cpp#L208-L216

I don't know if that's related to the problem but not checking for specific non-blocking error codes seems suspect to me.

Did you try to revert f7d9db9 ?

No, I just tried older and older cores until I found one that didn't show the problem.

Do you have a sketch to reproduce the issue?

Sure. You will have to setup some kind of server (HTTPS is better because the problem is more evident) and dump everything received (I use Nginx to terminate the SSL and forward the connection to a netcat):

#include <Arduino.h>
#include <Ethernet.h>
#include <EthernetSSLClient.h>
#include <ArduinoHttpClient.h>

EthernetSSLClient eth;
HttpClient http = HttpClient(eth, "iot.dndg.io", 443);

void setup()
{
    Ethernet.begin();
}

void loop()
{
    delay(2000);
    getConfiguration(&http);
}

void getConfiguration(HttpClient *httpClient)
{
    httpClient->beginRequest();
    httpClient->get("/opta/002E003D3532511134313632/cfg");
    httpClient->sendHeader("X-DNDG-Stamp", String(84814));
    httpClient->sendHeader("X-DNDG-SerialNumber", "002E003D3532511134313632");
    httpClient->sendHeader("X-DNDG-Applicationname", "ParkinGreen");
    httpClient->sendHeader("X-DNDG-FirmwareVersion", String(2));
    httpClient->sendHeader("X-DNDG-PublicKey", "2nNuFMzeh+ixOWo8bD4qUTpIN4KRf1O+zcFXJaSTKkcShr/lJdgVTxugFiH1SBEAe0caSUb3jw1Cs1o1PhmtgA==");
    httpClient->sendHeader("X-DNDG-Signature", "vKFYrGgFdOP9KRQa7eYHEEa0DFl1tnIfIsYV/eFnCAh2U/JUjYDkUQspE6dUTkiPx+yM//sb5WV2ybp7kGlVzA==");
    httpClient->endRequest();

    Serial.print("Result: ");
    Serial.println(httpClient->responseStatusCode());

    httpClient->stop();
}
fogzot commented 5 days ago

could you try this changes https://github.com/arduino/ArduinoCore-mbed/pull/912/files ?

Sorry but I am in no condition to setup an mbed development environment and build the core from scratch right now.

JAndrassy commented 5 days ago

those changes are in the Arduino library

pennam commented 5 days ago

The code in the change you linked still has a sock->set_blocking(true); but the code that has the problem is missing that line:

got it. I've removed the set_blocking(true); call because immediatly aftersock->set_timeout(SOCKET_TIMEOUT); is called vanishing the effect of set_blocking(true);

https://github.com/arduino/mbed-os/blob/e959ae057b31849f779904ce97bfabc68658f53e/connectivity/netsocket/source/InternetSocket.cpp#L189-L206

I will try to setup a test and reproduce the issue

pennam commented 5 days ago

Is /opta/002E003D3532511134313632/cfg a file? which is its size? Are you reading it each time or just making the request? If you don't read it the issue cloud be related to this https://github.com/arduino/ArduinoCore-mbed/issues/937#issuecomment-2388153161

fogzot commented 4 days ago

Is /opta/002E003D3532511134313632/cfg a file? which is its size? Are you reading it each time or just making the request? If you don't read it the issue cloud be related to this #937 (comment)

The result of the get is a 200 bytes JSON file (generated on the fly by the server). When we received malformed headers the server just send back a "400 malformed request" error and in both cases (OK or error) we always consume the full body of the response. I don't think the problem is related to how we use the HTTP library because it sometimes happens on the first request after resetting the Opta. Also, we did some tests using a plain EthernetClient and we get the same garbled headers on the server. Here is the code:

    EthernetSSLClient eth;

    eth.connectSSL("iot.dndg.io", 443);

    size_t w = eth.println("GET /opta/002E003D3532511134313632/cfg HTTP/1.1");
    w += eth.println("User-Agent: Finder Opta");
    w += eth.println("Host: iot.dndg.io");
    w += eth.println("Connection: close");
    w += eth.println("X-DNDG-Stamp: 84814");
    w += eth.println("X-DNDG-SerialNumber: 002E003D3532511134313632");
    w += eth.println("X-DNDG-Applicationname: ParkinGreen");
    w += eth.println("X-DNDG-FirmwareVersion: 2");
    w += eth.println("X-DNDG-PublicKey: 2nNuFMzeh+ixOWo8bD4qUTpIN4KRf1O+zcFXJaSTKkcShr/lJdgVTxugFiH1SBEAe0caSUb3jw1Cs1o1PhmtgA==");
    w += eth.println("X-DNDG-Signature: vKFYrGgFdOP9KRQa7eYHEEa0DFl1tnIfIsYV/eFnCAh2U/JUjYDkUQspE6dUTkiPx+yM//sb5WV2ybp7kGlVzA==");
    w += eth.println();
    eth.flush();
    Serial.print("Write size is ");
    Serial.println(w);
    // Sometimes it writes 356, sometimes 462.
pennam commented 4 days ago

Yesterday I've made some test with the EthernetClient but I was not able to reproduce it. I will try again today doing multiple resets and adding the check on the write size.

pennam commented 1 day ago

@fogzot can you give a spin to this PR https://github.com/arduino/ArduinoCore-mbed/pull/983 ?

It includes an mbed-os patch but i've also included the libmbed.a rebuild commit

Thanks

jbfraser1 commented 1 day ago

Pretty sure I was experiencing this as well. Don't have easy access to the server to see what was actually happening, but five out of five opta's updated to 4.1.6 last week started failing about 50% of https POST calls, with server returning an error indicating the headers or data payload were not in-spec. (http 415 in my case.)

One device connected to a different network started experiencing >90% failure, POSTing the same payload. Success or failure seems random.

Downgrading to 4.1.3 gets at least one Opta back to 100% success. I'll try others in the next 24 hours.

fogzot commented 20 hours ago

@fogzot can you give a spin to this PR #983 ?

It includes an mbed-os patch but i've also included the libmbed.a rebuild commit

Can you tell me how am I supposed to test the PR? I don't have a clear idea about how to replace the current core as downloaded via arduino-cli. It is possible to just clone the repository and use it to build a test program from a sketch?

pennam commented 19 hours ago

Yes, you can do it following the instructions in the README https://github.com/arduino/ArduinoCore-mbed?tab=readme-ov-file#installation

After you should see an additional core listed in the IDE or arduino cli

$ arduino-cli core list
ID                       Installato Ultimo                   Nome
arduino-git:mbed         9.9.9      9.9.9                    Arduino Mbed OS Boards

You can use this commit https://github.com/arduino/ArduinoCore-mbed/commit/944572ffddc45a43b5ff2043d80c2bf378d2a1bc since I've force pushed the PR to get the patch merged