Investigate breaking the max sketch size limit

devyte commented 4 years ago

Basic Infos

[x] This issue complies with the issue POLICY doc.
[x] I have read the documentation at readthedocs and the issue is not addressed there.
[x] I have tested that the issue is present in current master branch (aka latest git).
[x] I have searched the issue tracker for a similar issue.
[x] I have filled out all fields below.

Platform

Hardware: all
Core Version: all relevant (2.5+)
Development Env: all
Operating System: all

Problem Description

This issue is long term.

The binary size resulting from a sketch build is currently limited to 1MB. This is a hardware limitation in the ESP because it is the max size of code mem that can be mapped from the flash.

Not only is the limitation set to 1MB, but the binary can't span a 1MB address boundary. In our case, our binary is built from two pieces: the bootloader and the sketch itself. The bootloader is loaded to address 0x0, and the sketch is loaded 4K after that (at the time of this writing), which means that our limit is actually 1MB-4KB.

The binary can actually be flashed anywhere, as long as it is fully contained within a 1MB section of the address space, so 0 to 1MB-1, 1MB to 2MB-1, etc. This works because there is a base address register that is loaded, which defines the execution space mapping.

This is issue is meant to track ideas that could work around the limitations. Some current wild ideas:

Investigate compiling with overlays In theory, the linker supports overlays. I don't know what this means, or if it applies to our hardware, but it's the first thing that comes to mind when dealing with a program that is larger than available memory.
Investigate compiler support for bank switching. Compilers are known to support bank switching for RAM. Maybe it is possible to do the same for code space. If not, maybe there is something that can be done upstream on gcc side.
Investigate multiple binaries. It may be possible to build multiple binaries, and have special mechanisms for calling functions in a different binary than the current one. Possibilities here include N-way calling (any binary to any binary), or hierarchical calling (one binary would be like the master and would call functions in other binaries, which could call in other lower binaries, etc. At simplest implementation the top would be like a master with multiple isolated stand-alone slave binaries)

3a. Wrapper functions Wrapper functions could switch the register to the base address of the binary to access, call the relevant function, then switch the base address back before returning. At that point execution would continue normally.

3b. GCC instrumentation hooks GCC supposedly allows hooking code to before a function gets called and code to after a function returns, and that is configurable. I don't know the specifics.

3c. Manual switching Place the onus of switching on the user, i.e.: the user needs to switch the base address, call a function, and switch back.

In all three cases above, some compiler/linker dark magic would be needed to have the function addresses of a different binary available.

Multiple applications The bootloader boots the sketch binary, and then the sketch is oblivious of the bootloader. It may be possible to have multiple applications that "boot each other", where each application is limited to 1MB. I think here the global RAM state would be lost when changing between them, but maybe that can be finessed somehow. Example:
- app1 measures a bunch of sensors and logs into FS. Once per day, it boots into app2.
- app2 looks at FS, starts wifi, connects to whatever, and uploads the contents of FS. Then it boots back to app1.

TD-er commented 4 years ago

I think the bootloader should be able to handle a "fallback" sketch, which we can set from our main sketch. Such a fallback sketch could also be combined with multi-staged as you already mentioned. So the first booted one does inspect some variables (RTC memory flags?) and then decides to boot another sketch.

Multiple sketches also allow for special purpose parts of the code to be moved into different sketches. For example WiFi setup, OTA, etc. These do not really need to be included in the sketch for most purposes. As long as the "special purpose sketch" does have a proper timeout and return back to normal fallback.

One thing I don't see mentioned about the sketch size limit is handling of flash strings. I don't see why these have to be in the sketch. Why can't they be stored somewhere else in the flash and called from there? This doesn't have to be in a filesystem location, but it can be somewhere else. The same could be applicable for static data, which is somewhat similar to flashstrings. (fonts for displays etc.)

Maybe something similar to the "pre-cache" discussed in another issue here can also be used to collect program code from outside the 1M limit into memory?

devyte commented 4 years ago

handling of flash strings

Because strings used from code get compiled into the binary, from where they get accessed by instructions like any other array or variable. What you're describing, i.e.: having the strings elsewhere, is akin to using files. You can do it, but you, as the user, have to put them there, and create a read/write strategy. Think of putting all strings in a file: only you know which string starts where. In fact, you could put all of your strings in files on the FS, e.g.: one string per file, and get them via (short) filename. I saw that done somewhere for http error strings, i.e.: map error codes to error string.

devyte commented 4 years ago

I've been investigating this a bit, and everything seems to point to the solution to supporting bins > 1MB is Point 1 in my OP: building with overlays plus an overlay manager.

At a glance, it appear to be possible to make the whole thing almost transparent to the coder. The "almost" is the user having to choose which overlay his code functions go into, which I think would be with function decorators much like the current ICACHE_RAM_ATTR/FLASH etc.

Given that in our architecture loading an overlay just means loading a register with the base address of the needed flash 1MB segment, I think it is reasonable to use overlays of the entire mappable address space of 1MB (i.e.: swapping an overlay is very little cost). I don't know if it's even viable to implement overlay sections of a different size than 1MB. I suspect it may be possible, but not worth it. Given that the current board with biggest flash size is the wemos d1 pro with 16MB, I think the worst case to reasonably consider would be 16 overlays of 1MB. I can't even begin to conceive an application that would require such a size, but... well, worst case.

The magic for specifying the overlay sections seems to happen within the ld linker scripts. From looking at the linker manual below, it seems that there are two symbols that get defined for every overlay, and those symbols need to be implemented as functions that do the actual switching. In other archs it would be copying from storage to the limited mem. In our case it should just changing the base address register for the flash mapping. It doesn't seem alien-level complex, but I'm afraid it's voodoo beyond my current level or time availability.

Some resources I found (unrelated to the ESP): https://en.wikipedia.org/wiki/Overlay_(programming) https://ftp.gnu.org/old-gnu/Manuals/ld-2.9.1/html_node/ld_22.html https://forums.parallax.com/discussion/163970/overlay-code-with-gcc http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4234.html https://codingrelic.geekhold.com/2011/01/overlays-not-yet-extinct.html

This is about as far as I can take the idea. If anyone reading this knows about building with overlays, or has interest in pursuing further, help would be appreciated.

TD-er commented 4 years ago

Given that the current board with biggest flash size is the wemos d1 pro with 16MB, I think the worst case to reasonably consider would be 16 overlays of 1MB. I can't even begin to conceive an application that would require such a size, but... well, worst case.

Did you find anything on max. OTA size? Can these large binaries be upgraded via OTA?

devyte commented 4 years ago

Obviously the empty space area would need to be big enough to receive the OTA bin. The above worst case is for no OTA and no FS.

TD-er commented 4 years ago

That's clear for sure, but I was wondering if there could be any other issue handling such large binaries.

devyte commented 4 years ago

I don't think do, nothing comes to mind. In theory the Updater doesn't care about size except to know where to start writing the chunks, and eboot also shouldn't care about size during the copy process. That's as long as we're dealing with a single binary.

earlephilhower commented 4 years ago

Bank switching on micros isn't normally an all-or-nothing thing (i.e. you swap out 8K of ROM in your address space vs. the entire address space). Some soft of thunk layer in IRAM would be needed to bounce between banks or you'd get some serious weirdness when the code at the PC you executed was just changed. Not having some nice "common" area makes it more difficult.

I-cache would also need to be invalidated, obviously, on a bank swap.

We'd need to ensure that any calls/returns to the blob from Espressif always had a bank swap involved or that we have 2 copies of the blob code at the exact same spot in both banks.

If they could get Epyx Summer Games on the Atari 2600, I'm sure we can do this, too. But I fear it's gonna be a real pain...

earlephilhower commented 4 years ago

https://github.com/raburton/rboot

Contains code to handle swapping the cache between segments.

dirkmueller commented 4 years ago

For OTA I am using a 'temporary updater' sketch that is loaded by the primary sketch. This temporary sketch is then ota installing a full sketch from the spiffs again and loads it.

Using this two step update I can OTA sketches > 512 kb

esp8266 / Arduino