OTA update often fails, mainly on larger bin files

TD-er commented 2 months ago

Board

Any, mainly with larger flash sizes

Device Description

This is about OTA updates (and file uploads) failing. With larger flash sizes, you typically upload larger sketches, so it happens more often on nodes with large flash. However I do not think it is hardware related

Hardware Configuration

See above

Version

latest master (checkout manually)

IDE Name

PlatformIO

Operating System

Windows 11

Flash frequency

Any

PSRAM enabled

yes

Upload speed

115200

Description

In this PR I already increased the timeout values back to 5 seconds, which makes it usable again. However an OTA update still fails every now and then, especially on larger (2MB or more) uploads. Those updates fail on ESP32-classic, ESP32-C6 and -S3 with 8 or 16M flash. (thus all available units with > 4M flash as far as I know) However this does fail way more often on the C6 (probably because it is a single core?)

When increasing these HTTP_MAX_xxxx_WAIT values to 10 or 15 seconds, these OTA updates do fail less often with a noticable improvement on the larger ones. However I don't like to "just increase" the timeouts to whatever level may 'feel' fine without knowing why it fails. So instead of just opening a PR to increase these timeouts I create this issue :)

Sketch

Debug Message

Other Steps to Reproduce

No response

I have checked existing issues, online documentation and the Troubleshooting Guide

[X] I confirm I have checked existing issues, online documentation and Troubleshooting guide.

me-no-dev commented 2 months ago

It's interesting to find out if such issues exist with IDF's OTA facilities too. If not, maybe the blocking nature of Arduino is playing some role. As you know, this is hard to diagnose, even with things like tcpdump, because of the large size transfer.

TD-er commented 2 months ago

I just talked to @tonhuisman about this and he mentioned his OTA failures are mainly on the C6. He did also do most of the testing with different timeouts. Could it be that the WiFi may be interrupted more often on the C6 due to flash writes taking more time and it all running on the same core compared to the classic ESP32 and the S3?

TD-er commented 2 months ago

I was just thinking... Are the sectors erased while writing or do you call block erase (32k or 64k) before the actual writing to clear out the needed sectors?

Typical erase times:

Sector erase 4 kB: 65 ms (typical)
Block erase 32 kB and 64 kB: 150 ms and 240 ms (typical)
Full chip erase: 30 s (typical)

Write times:

Page program time: 0.4 ms (typical)

However the worst-case timings are quite a bit longer (Infineon states upto 5 seconds for some chips)

So maybe a pre-erase could be useful here?

Jason2866 commented 2 months ago

Probably the suggestion from me-no-dev is the way to find out where ground laying issue is coming from. First we need to know if IDF behaves correctly or not.

me-no-dev commented 2 months ago

@TD-er we do not erase flash ourselves. This is done through IDF's partition API. IDF's OTA in general will do the same exact thing. The difference will come with the "Client" code that retrieves the file from the network. If issue is only on single core chips, then it is possible that WiFi is missing some packets to the point where TCP is not sufficient to overcome the problem. looking at the packets at the end of transfer would be showing what is exactly happening

TD-er commented 2 months ago

looking at the packets at the end of transfer would be showing what is exactly happening

Not entirely sure what you mean here... "at the end" like the moment when the OTA fails? And I assume you mean some wireguard session to look at?

vortigont commented 2 months ago

Are the sectors erased while writing or do you call block erase (32k or 64k) before the actual writing to clear out the needed sectors?

writing is done per sector, erasing is done one block ahead on reaching each new block boundary.

IMHO need to look on how callbacks with POST is handled, i.e. if pages are fully written or partial sector writes are happening due to unaligned batches of data coming from TCP packets in multiples of MSS. I've done some tests in this area when implemented esp32-flashz lib, there writes are coalesced into 32k chunks of gzip inflator buffer. Had not had any noticeable issues that I remember of that time (pre C6 era). Maybe OTA sketches could be optimized to have some kind of sliding buffer to accumulate incoming data and do aligned pages writes.

TD-er commented 2 months ago

For some other project I have very recently been looking into timing aspects (and lifetime) of the flash chip types which are also used on ESP boards and I came across this document written by Infineon: Understanding Typical and Maximum Program/Erase Performance

In here, it is stated that the chips they tested could have a max. sector erase time of 5 seconds. It seemed it is absolutely possible these maximums could occur. See bottom of page 2 in the PDF:

Given this data, application engineers attempt to answer the following question:

In this example, do all cells of the device take the maximum time to program/erase after one million cycles?

In short, the answer to this question is no. Based on experiments, an absolute worst-case program specification is calculated to have approximately 10% of the words programming at the maximum time, while the other 90% program at the typical rate (also assuming 90° C, VCC = 1.65 volts after 1,000,000 cycles, in the case of the example from Table 1).

So maybe regardless of implementation, the current timeout of 5 seconds could simply be too short as the erase alone can sometimes take longer.

N.B. this is what Infineon found, I do not know if other vendors may allow for longer max. erase times or whether it is defined in some kind of flash specification all should comply with.

sblantipodi commented 2 months ago

testing it on S2 and increasing the timeouts does not help here.

espressif / arduino-esp32