esp8266 / Arduino

ESP8266 core for Arduino
GNU Lesser General Public License v2.1
15.98k stars 13.33k forks source link

httpUpdate (ssl client) sometimes hangs #3908

Closed whyameye closed 6 years ago

whyameye commented 6 years ago

Basic Infos

Hardware

Hardware: ESP-12F Core Version: master commit 7b09ae5f69f8dccb5152d128c8484ec9a18adb11 Nov 2017

Description

Using ESPHTTPUpdate sometimes the update succeeds, sometimes it fails and reports back failure (OK) but worst is that sometimes it just hangs with a partial download, never returning from the update call. No watchdog reset. The only way to get the uC back is to manually reset it.

Digging into this with issue reports etc. I got lost quickly: is this master using lwip or lwip2? If it is using lwip should I try switching to lwip2 and if so how? Is there some sort of timeout or watchdog reset that could happen in this failure case and if so is there an easy way for me to trigger it? I saw a timeout set on the http connect to 8000ms (can't remember where I saw that) but it doesn't appear to be working. The server closes the connection with a partial download and the ESP8266 just sits there, no reset, no nothing, not executing my code not returning from the ESPhttpUpdate.update call.

Problem description

Settings in IDE

Module: Node 1.0 Flash Size: 4MB/1MB SPIFFS CPU Frequency: 160Mhz Flash Mode: DIO

d-a-v commented 6 years ago

Master uses lwip2 by default. It is selectable in menus (tools>lwIP variant: v2 or the 3 others (=v1.4)). You can try with "Prebuilt" which was the previous default lwIP variant.

whyameye commented 6 years ago

switching to "Prebuilt" solved this so I'll go w/ that for now. If I can help debug lwip2 I'm glad to. I'm under the gun right now but by the end of the week I should have some time to help solve this.

devyte commented 6 years ago

@whyameye please provide a MCVE sketch to reproduce the issue.

d-a-v commented 6 years ago

@whyameye Is it still relevant using latest master ?

whyameye commented 6 years ago

I'm traveling and don't have a device with me to test. I can test next week. Sorry for all the delays.

On Mon, Jan 8, 2018 at 4:47 AM, david gauchard notifications@github.com wrote:

@whyameye https://github.com/whyameye Is it still relevant using latest master ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/esp8266/Arduino/issues/3908#issuecomment-355934005, or mute the thread https://github.com/notifications/unsubscribe-auth/ABoy5sya5R05vdVmyl8vcrLHDZDhMHWcks5tIfIogaJpZM4Q1AfM .

whyameye commented 6 years ago

using commit 8b87491700c3f666f92c2283c68bfd770e868a7c this does not seem to be an issue anymore for v2 MSS 536 but is still an issue for v2 MSS 1460. Since 1460 is already marked "unstable" maybe this isn't a huge deal now. I could still make an MCVE if you need.

devyte commented 6 years ago

Please provide an mcve sketch, it always helps.

whyameye commented 6 years ago

This test code is essentially identical to the supplied httpUpdate.ino example except that I added credentials for an https ssl/tls download on a server I run. The update downloaded by this test code is insignificant code that should be ignored.

The test code works fine with lwiP v1.4, using either "compile from source" or "higher bandwidth" options. With lwiP v2.0 both "higher bandwidth" and "Lower Memory" options fail with message HTTP_UPDATE_FAILD Error (6): Update error: ERROR[6]: Stream Read Timeout or they just hang indefinitely.

The test code was compiled using master branch commit bf5a0f24dc66f7a1f81cd4f53fcf1fdca50394c1 dated February 8.

d-a-v commented 6 years ago

@whyameye Can you re-upload your MCVE sketch, the above link is currently broken.

whyameye commented 6 years ago

sorry about that. Link is fixed.

d-a-v commented 6 years ago

Without sketch modification apart from AP, url and fingerprint,

So I don't have the same error as yours, but there's something wrong at the final updater stage. I will also try #4054.

Anyway, this shows that you must release a lot of ram before going to update through SSL. You should also be using the OOM debug option.

d-a-v commented 6 years ago

@igrr, I tried to download a ~350KB file (a firmware) using the example StreamHTTPClient.ino in which instead of printing data I print "total received" at each loop, with

http.begin("myserver",myport,"/updateTest.bin","","myfingerprint");

and I ran into Alert: invalid mac (and more, at the end). I could not see this message in the updater sketch above because of :oom with all debug messages enabled.

I reckon I'm far from understanding all ssl/tls internals.

@earlephilhower Should I also try with BearSSL ?


[...]

:rdi 133, 124
State:  receiving Certificate (11)
=== CERTIFICATE ISSUED TO ===
Common Name (CN):               myserver
Organization (O):               <Not Part Of Certificate>
Basic Constraints:              critical, CA:FALSE, pathlen:10000
Key Usage:                      critical, Digital Signature, Key Encipherment
Subject Alt Name:               myserver 
=== CERTIFICATE ISSUED BY ===
Common Name (CN):               Let's Encrypt Authority X3
Organization (O):               Let's Encrypt
Country (C):                    US
Not Before:                     Thu Mar  1 03:42:08 2018
Not After:                      Wed May 30 03:42:08 2018
RSA bitsize:                    4096
Sig Type:                       SHA256
=== CERTIFICATE ISSUED TO ===
Common Name (CN):               Let's Encrypt Authority X3
Organization (O):               Let's Encrypt
Country (C):                    US
Basic Constraints:              critical, CA:TRUE, pathlen:0
Key Usage:                      critical, Digital Signature, Key Cert Sign, CRL Sign
=== CERTIFICATE ISSUED BY ===
Common Name (CN):               DST Root CA X3
Organization (O):               Digital Signature Trust Co.
Not Before:                     Thu Mar 17 16:40:46 2016
Not After:                      Wed Mar 17 16:40:46 2021
RSA bitsize:                    2048
Sig Type:                       SHA256
:rd 5, 133, 124

[...]

339120 316544
339120 316672
339120 316800
339120 316928
339120 317056
339120 317184
339120 317312
:rd 5, 2144, 433
:rdi 103, 5
ssl->need_bytes=16448 > 16379
:rd 1706, 2144, 438
:rdi 98, 98
:c 98, 536, 2144
:rdi 536, 536
:c 536, 536, 1608
:rdi 536, 536
:c 536, 536, 1072
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 536, 536, 0
:rdi 536, 536
:c0 536, 536
:rn 536
:rd 270, 536, 0
:rdi 536, 270
Error: invalid mac
:wr 69 69 0
:wrc 69 69 0
Alert: bad record mac
:wr 69 69 0
:wrc 69 69 0
Alert: close notify
:rch 536, 536
:rch 1072, 536
:rch 1608, 536
:sent 138

[sits there, no more output]
earlephilhower commented 6 years ago

"mac" in this case is a id/sequence portion at the end(IIRC) of the TLS fragment, not the more common Ethernet address. Once one goes bad you're pretty much hosed on that connection, and indicates corruption either in transport or in RAM buffers. "close notify" means the SSL connection is dead even if the underlying TCP/IP socket is still viable.

@d-a-v I've not looked at the httpUpdate class at all, but BearSSL's at a state where is should be possible to give it a go if you really want. Because BearSSL is stricter the code needs to apply the fingerprint after object creation, before you connect. Or for testing just add a "client.setInsecure()" and it'll accept any certificate. The BearSSL fingerprint validator also takes the 20-byte fingerprint, not a string representation of hex digits, so it's slightly different there as well.

earlephilhower commented 6 years ago

Just the following hack to ESP8266HTTPClient.cpp and running the script @ filimin.xyz seems to work fine w/BearSSL with v2-high-BW. The code downloads from the https server and reboots as expected (it looks like the downloaded BIN simply connects to the HTTP at the same spot and re-downloads the same BIN but it prints out a "Firmware update loading from update URL http://..." line instead of the silence in the given .ino).

So it's OOM w/AXTLS somewhere, as you were seeing, @d-a-v . Not sure what the remediation should be.

class TLSTraits : public TransportTraits
{
public:
    TLSTraits(const String& fingerprint) :
        _fingerprint(fingerprint)
    {
    }

    std::unique_ptr<WiFiClient> create() override
    {
        WiFiClientBearSSL *client = new WiFiClientBearSSL();
        client->setInsecure(); // I'm lazy here, could parse _fingerprint and pass to cilent->setFingerprint(uint8_t[20])
        return std::unique_ptr<WiFiClient>(client);
    }

    bool verify(WiFiClient& client, const char* host) override
    {
        return true;
    }

protected:
    String _fingerprint;
};
d-a-v commented 6 years ago

@earlephilhower thanks. It works well with all three versions of lwip. I only tested with the example StreamHTTPClient.ino, not the uploader.

And I take this opportunity to thank you for all the ongoing BearSSL work. I don't know the plans for merging, it looks quite well polished :)

It seems to take more RAM, but if I understand well, your couple ::supportMaxFragmentLen() / ::setBufferSizes will help with that, can they do the job automatically ?

@whyameye Currently we are using the goodold axTLS. I get the same behavior as yours, sometimes it works, sometimes not. I tried to dig a bit with no luck, and luckily we have this huge work with BearSSL coming soon to us.

whyameye commented 6 years ago

I appreciate all the work on this. I'm excited for BearSSL! :+1:

It seemed like we were having OOM issues with axTLS and now I read BearSSL might take more memory but does also solve the issues we are seeing? I'm probably misunderstanding something.

earlephilhower commented 6 years ago

Maybe I'm missing something, but if I understand your comment @d-a-v then this is an OOM condition w/axtls, no?

If the RAM needs between 1456 and 576 MSS should only be 2-3KB different in this use case, and the larger MSS consistently fails, that means the smaller one is right on the edge anyway and heap fragmentation could make the difference between pass and fail, no?

If that's the case, is there anything we can really do? Unfortunately, axTLS doesn't check memory allocation success, and the effort needed to put it in seems large. Plus, you'd just end up with this failing and printing "OOM in axTLS" anyway...not success.

Even if BearSSL takes a little more memory, because the memory is allocated on connection start it either works, or you don't get an SSL connection. No fragmentation, predictable performance. The constant 4.5KB 2nd stack can even be hidden in "unused" SYS space to make it about equal as compared to axTLS.

Setbuffersizes lets you shrink the recv buffer if you know what you're doing, and lets the xmit buffer be as small as 512bytes+fluff. The MaxFragmentLen TLS extension lets you negotiate w/remote servers to legally shrink the recv buffer down to 512b-4kb, but until OpenSSL 1.1.1 is adopted by folks it's not going to be supported. Even w/o this, if you know individual packets are limited in size, you can shrink the buffers (i.e. MQTT where you say you won't accept messages >4KB in length which is enormous for that purpose, or HTTP RANGE GET requests).

earlephilhower commented 6 years ago

Your code will work as-is with the BearSSL commit just pushed as long as you change the fingerprint into a uint8_t[20] array (i.e. {0x4c, 0x90, ...});.

When I get the time I'll add real x509 certificate validation for this (just adding methods w/right signature to pass it down to the BearSSL object). SHA fingerprints are not secure and don't guarantee you're talking to who you think you are...

[HTTP-Client][begin] host: filimin.xyz port: 443 url: /updateTest.bin BearSSL-httpsFingerprint: 4c 90 ca eb c4 83 ab d7 3b 73 4d 97 29 8b 2c 24 d1 75 a6 4f
[HTTP-Client] connected to filimin.xyz:443
[HTTP-Client] sending request header
-----
GET /updateTest.bin HTTP/1.0
Host: filimin.xyz
User-Agent: ESP8266-http-Update
Connection: close
x-ESP8266-STA-MAC: 60:01:94:22:B2:F1
x-ESP8266-AP-MAC: 62:01:94:22:B2:F1
x-ESP8266-free-space: 2736128
x-ESP8266-sketch-size: 409536
x-ESP8266-sketch-md5: 928ac1a2270c7f1efa6394e5a4853508
x-ESP8266-chip-size: 4194304
x-ESP8266-sdk-version: 2.2.1(cfd48f3)
x-ESP8266-mode: sketch

-----
[HTTP-Client][handleHeaderResponse] RX: 'HTTP/1.1 200 OK'
[HTTP-Client][handleHeaderResponse] RX: 'Date: Fri, 09 Mar 2018 20:33:25 GMT'
[HTTP-Client][handleHeaderResponse] RX: 'Server: Apache/2.4.27 (Ubuntu)'
[HTTP-Client][handleHeaderResponse] RX: 'Last-Modified: Sun, 28 Jan 2018 20:06:35 GMT'
[HTTP-Client][handleHeaderResponse] RX: 'ETag: "43fd0-563dbaaa1083f"'
[HTTP-Client][handleHeaderResponse] RX: 'Accept-Ranges: bytes'
[HTTP-Client][handleHeaderResponse] RX: 'Content-Length: 278480'
[HTTP-Client][handleHeaderResponse] RX: 'Connection: close'
[HTTP-Client][handleHeaderResponse] RX: 'Content-Type: application/octet-stream'
[HTTP-Client][handleHeaderResponse] RX: ''
[HTTP-Client][handleHeaderResponse] code: 200
[HTTP-Client][handleHeaderResponse] size: 278480
sleep disable
pm open,type:0 0
[HTTP-Client][end] tcp is closed
[HTTP-Client][end] tcp is closed
HTTP_UPDATE_OK
d-a-v commented 6 years ago

About the OOM, yes it is unclear. Like @whyameye reported, axTLS and this big download sketch sometimes works and sometimes not. When It doesn't, It can be an OOM, or it can be a "tls-mac" problem. When it works, free-heap printed at each loop during the transfer is larger than with bearSSL. (edit: regardless lwip version)

earlephilhower commented 6 years ago

Are we measuring heap during operation, or just between axTLS calls? I've also not done extensive testing (gotta re-upload the original sketch after each successful reprogramming), so if it's really intermittent I may not have caught it.

My guess is that even though it starts with more free space on axTLS, due to there being mallocs() in the main decryption loop and fragment operations, the real needed space is > than available in some instances. With BearSSL, the design goal of the original author was "allocate mem once, before connection" so only stack space is used during normal operation. Not saying one is better than the other, but (assuming no TCP memory needs, etc.) if you had 0 bytes of heap available w/BSSL it would still work, whereas axTLS would crash hard.

How often is the failure? If I get, say, 10 successes in a row would that point to BSSL being "good," or is it only something like a 1 in 100 chance of failure?

d-a-v commented 6 years ago

In my tests I did measure the heap in the StreamHTTPClient.ino's main loop, that I used instead of the referenced updater, after I found that the problem occurred during download and was not particularly linked to the flashing process itself.

Sorry I was not clear enough: BearSSL was always working in my tests ! - I did not run it 100 times in a row yet, though I'm eager to get time and run my tests Thanks for the details. There's one more I'm curious about, asked in #4273.

Failures I described above occurred with axTLS.

whyameye commented 6 years ago

@earlephilhower using your PR #4273 and changing one line in your hack above from WiFiClientBearSSL *client = new WiFiClientBearSSL(); to BearSSL::WiFiClientSecure *client = new BearSSL::WiFiClientSecure();

I was able to get httpUpdate to work with BearSSL. I'm wondering though the "right" way to do this with the PR and using a certificate authority to validate a chain of certificates specifically for the update. (CA validation works well with BearSSL fetching URLs :+1: )

Thanks in advance for your help.

earlephilhower commented 6 years ago

@whyameye The hack above is no longer needed, there is a BearSSL enabled fingerprint validator already in the PR. Just pass in a uint8_t[20] array (i.e. parse the hex string into binary values) and it will make a BearSSL connection and use that fingerprint to validate things. For the httpUpdater, that's as good a validation as can be done with the current API.

Once the BearSSL PR is merged, we can see about using certs/etc. in the httpUpdater, but for now given the fact that it's already at 18K+ lines and 70 files changed, just doing the fingerprint validation (i.e. keeping httpUpdater API) is the best we can do.

devyte commented 6 years ago

BearSSL is merged in #4273 , with alternate BearSSL::WiFi* classes. Although axtls-based classes are still available and even the default, they are planned for deprecation and then retirement, hence won't be fixed. Any issues with BearSSL-based classes should be reported in new issues. Also, the bearssl version is verified to work in this case. Closing.