Open TD-er opened 2 months ago
It's interesting to find out if such issues exist with IDF's OTA facilities too. If not, maybe the blocking nature of Arduino is playing some role. As you know, this is hard to diagnose, even with things like tcpdump, because of the large size transfer.
I just talked to @tonhuisman about this and he mentioned his OTA failures are mainly on the C6. He did also do most of the testing with different timeouts. Could it be that the WiFi may be interrupted more often on the C6 due to flash writes taking more time and it all running on the same core compared to the classic ESP32 and the S3?
I was just thinking... Are the sectors erased while writing or do you call block erase (32k or 64k) before the actual writing to clear out the needed sectors?
Typical erase times:
Write times:
However the worst-case timings are quite a bit longer (Infineon states upto 5 seconds for some chips)
So maybe a pre-erase could be useful here?
Probably the suggestion from me-no-dev is the way to find out where ground laying issue is coming from. First we need to know if IDF behaves correctly or not.
@TD-er we do not erase flash ourselves. This is done through IDF's partition API. IDF's OTA in general will do the same exact thing. The difference will come with the "Client" code that retrieves the file from the network. If issue is only on single core chips, then it is possible that WiFi is missing some packets to the point where TCP is not sufficient to overcome the problem. looking at the packets at the end of transfer would be showing what is exactly happening
looking at the packets at the end of transfer would be showing what is exactly happening
Not entirely sure what you mean here... "at the end" like the moment when the OTA fails? And I assume you mean some wireguard session to look at?
Are the sectors erased while writing or do you call block erase (32k or 64k) before the actual writing to clear out the needed sectors?
writing is done per sector, erasing is done one block ahead on reaching each new block boundary.
IMHO need to look on how callbacks with POST is handled, i.e. if pages are fully written or partial sector writes are happening due to unaligned batches of data coming from TCP packets in multiples of MSS. I've done some tests in this area when implemented esp32-flashz lib, there writes are coalesced into 32k chunks of gzip inflator buffer. Had not had any noticeable issues that I remember of that time (pre C6 era). Maybe OTA sketches could be optimized to have some kind of sliding buffer to accumulate incoming data and do aligned pages writes.
For some other project I have very recently been looking into timing aspects (and lifetime) of the flash chip types which are also used on ESP boards and I came across this document written by Infineon: Understanding Typical and Maximum Program/Erase Performance
In here, it is stated that the chips they tested could have a max. sector erase time of 5 seconds. It seemed it is absolutely possible these maximums could occur. See bottom of page 2 in the PDF:
Given this data, application engineers attempt to answer the following question:
In this example, do all cells of the device take the maximum time to program/erase after one million cycles?
- In short, the answer to this question is no. Based on experiments, an absolute worst-case program specification is calculated to have approximately 10% of the words programming at the maximum time, while the other 90% program at the typical rate (also assuming 90° C, VCC = 1.65 volts after 1,000,000 cycles, in the case of the example from Table 1).
So maybe regardless of implementation, the current timeout of 5 seconds could simply be too short as the erase alone can sometimes take longer.
N.B. this is what Infineon found, I do not know if other vendors may allow for longer max. erase times or whether it is defined in some kind of flash specification all should comply with.
testing it on S2 and increasing the timeouts does not help here.
Board
Any, mainly with larger flash sizes
Device Description
This is about OTA updates (and file uploads) failing. With larger flash sizes, you typically upload larger sketches, so it happens more often on nodes with large flash. However I do not think it is hardware related
Hardware Configuration
See above
Version
latest master (checkout manually)
IDE Name
PlatformIO
Operating System
Windows 11
Flash frequency
Any
PSRAM enabled
yes
Upload speed
115200
Description
In this PR I already increased the timeout values back to 5 seconds, which makes it usable again. However an OTA update still fails every now and then, especially on larger (2MB or more) uploads. Those updates fail on ESP32-classic, ESP32-C6 and -S3 with 8 or 16M flash. (thus all available units with > 4M flash as far as I know) However this does fail way more often on the C6 (probably because it is a single core?)
When increasing these
HTTP_MAX_xxxx_WAIT
values to 10 or 15 seconds, these OTA updates do fail less often with a noticable improvement on the larger ones. However I don't like to "just increase" the timeouts to whatever level may 'feel' fine without knowing why it fails. So instead of just opening a PR to increase these timeouts I create this issue :)Sketch
Debug Message
Other Steps to Reproduce
No response
I have checked existing issues, online documentation and the Troubleshooting Guide