devyte commented 7 years ago

Hardware

Hardware: all Core Version: git / 2.4-rc2

Problem Description

Summary

This issue is meant to track a RAM optimization effort. The goal is to increase available RAM as much as possible.

Details

The following are initial ideas to investigate.

Move all const char strings to flash, and use the the access macros to access them. Implement alternate methods/functions with the correct argument signatures where applicable. This covers both the core as well as the Arduino lib code.
Move const declarations to flash. Example: CHARS_PER_LINE in libb64/cencode.c.
Revise all singleton object declarations. Make sure that all are wrapped by #defines that allow compile-time configuration to not use the global singletons. Example: Use of the Serial, but not Serial1, frees up 256+ bytes of heap.
Revise all uses of constant size char [] arrays. Arrays like that should not be declared on the stack, but rather created dynamically, and only on demand. Example: 2K buffer in ESP8266Webserver. -Revise all uses of String. When concatenating, the internal managed array will grow, sometimes far beyond what is needed. It is sometimes better to reserve, if there is an estimate of how much space will be needed.
Revise object implementations. Make sure objects are passed by reference or pointer, and never by copy.

Pablo2048 commented 7 years ago

Actually I suggest to modify debug messages also to store text into PROGMEM. I've write macro for this reason: `#ifdef DEBUG_ESP_PORT

define DEBUG_MSG(_1, ...) DEBUG_ESP_PORT.printf_P(PSTR(_1), ##__VA_ARGS__)

else

define DEBUG_MSG(...)

endif`

d-a-v commented 7 years ago

There is this macro USE_OPTIMIZE_PRINTF in osapi.h which, if defined, makes all os_printf's fmt stored in flash.

Pablo2048 commented 7 years ago

Yes, but IMHO this is not used in debug messages inside libraries - for example look into ESP8266WiFiGeneric.h macro DEBUG_WIFI_GENERIC. Every debug macro inside library uses DEBUG_ESP_PORT.printf() so it uses Serial, or Serial1 object, not os_printf...

earlephilhower commented 7 years ago

You probably already know this, but the following command lines are very useful for finding stuff in RODATA that's eating up RAM:

For finding constant strings: objdump  -s -j .rodata /tmp/arduino_build_<code>/*.elf

For finding symbols(aka variables): readelf -a /tmp/arduino_build_<code>/*.elf| grep 3f

Some wasteful things that stick out for me:

ctype: 256 byte constant lookup table that appears in most apps using any string ops
dns_table: 1.1KB to hold DNS entries for only 4 hostnames. This needs LWIP recompiled with a smaller #define'd hostname length.

d-a-v commented 7 years ago

@earlephilhower about _dnstable, what would be reasonable? 4 hosts, 64 chars each ? (check here and there, knowing that lwip/256chars is RFC-compliant)

earlephilhower commented 7 years ago

@d-a-v well, that's a good question. I do know the RFC says 256 octets, but practically on a tiny ESP8266 you're likely only initiating connections to 1 or 2 servers with 32-64 byte hostnames at any time.

With no changes to code, I like your 64-byte x 4 suggestion. If small LWIP patches can be entertained, then I'd either submit something that only stored hashes of the hostname (and thus only requiring 64-bytes for any length hostname) or did a dynamic allocation like @devyte is considering.

AFAIK there are already memory allocations in LWIP, so this second option may not break any contracts (but I've only spent a little bit going through the code so could be wrong here).

And if we're going heroic on it, valid DNS hostnames only allow something like 40 valid characters so you could even do a packed 6-bit encoding to save 25% of any space without any compromise on lengths.

devyte commented 7 years ago

@earlephilhower thank you for that readelf/objdump suggestions, I've actually done that, but only once before and a looong time ago. It didn't even cross my mind to apply it here. Any further suggestions along those lines are welcome.

devyte commented 7 years ago

I seem to remember that at the nodemcu firmware repo they pulled some tricks in a 2-nd pass linker step, or post-link, or something along those lines. In essence, they post-processed the built binary. Does anyone know the details, and whether a similar approach would make sense here?

devyte commented 7 years ago

@d-a-v what do you think of compiling lwip2 with C++? I migrated a big C project (>1M lines of code) to C++ a few years back, and it wasn't too painful. There's a lot of potential for tweaks inside lwip, but doing them in C-world makes for fragile code (I speak from experience).

d-a-v commented 7 years ago

@earlephilhower updating dns names from 256 to 64 chars can be done at no cost (and I guess no harm), this is what lwipopts is intended for. But playing hero and changing string encoding to pack names is I think, going to kill lwip a little more than it already is because everything is subject to bugs and lwip is actually maintained elsewhere. My thinking is that the little esp82jewel already has enough bugs to cope with. I started to try and debug lwip once, like many others before me, but here, it is not the place to do that. That's why I tried, with #3362, to have a working-maintained-unmodified lwip. This implementation downloads and compiles original lwip with not any patch. Thus lwip bugs/improvements should be tracked there and not here.

So @devyte, we can try and compile lwip with C++, but I think we should submit all patches to lwip-maintainers, not trying to maintain them here.

I too have a proposal for fun: #3362 is an abstraction of lwip's netif (the link-layer) for any tcp stack on top of it. If you know about a smaller/lighter ip/tcp implementation (better ram+flash footprint), we can try and use it - WiFi* classes would however have to use this new stack.

earlephilhower commented 7 years ago

@d-a-v I saw you working on this when I was doing the HTTPS Server, but it didn't click that you were using an unmodified LWIP build. That sounds great!

I'd like to take a deeper look (should I ever get some time) at the LWIP repos and see if I can submit a PR that does at least the dynamic allocation of hostnames. That itself should be an almost trivial change and reduce memory dramatically in most cases (DNS for www.yahoo.com would take 16-bytes for the string vs. 256 w/a fully RFC-compliant configuration or even 64 with the reduced name length one). On LWIP init allocate 1-byte "\0" string pointers for the hostname. Then on any assignment w/the hostname on the LHS just do a "free(hostname[x]), hostname[x] = strdup(searchname)"...

@devyte - One additional tweak to the command lines I use can give you a sorted list by symbol size: readelf -a /tmp/arduino_build_*/*.elf| grep 3ff | sort -n -k 3

d-a-v commented 7 years ago

@earlephilhower it seems a good idea. Asking them if they would adopt such a patch would be the first harmless thing to do ? In the meantime what would all recommend for the static table ? 3 entries 48 chars each ? 4 entries 64 chars each ? (so I update defaults at least in lwip2)

d-a-v commented 7 years ago

@devyte in your migration process, did you just make your C project C++ compliant without further changing data structures ?

devyte commented 7 years ago

@d-a-v The (simplified) outline of what I did was as follows:

renamed all source files from .c to .cpp to get the C++ compiler to kick in
fix all of the C++ reserved keywords that happened to be used in C
remove all of the extern "C" wrappers
fix all of the symbol name clashes (globals, function names)
other remaining details, which were rather few

I didn't change any data structures., except due to reserved keywords.

earlephilhower commented 7 years ago

I think that'd kind of defeat @d-a-v 's hope of just pulling LWIP and compiling straight from the sources, no?

Are you just trying to get a deeper linting? Maybe a "-Wall -Wpedantic" to the boards.txt GCC command line would get you where you want to be without code changes and file renames? I'm not sure what running g++ instead of gcc is going to give other than stricter warnings...

dioguerra commented 7 years ago

I detected that the biggest HEAP consumption happened when moving from 2.3.0 to 2.4.0-RC1

With the same project i had 34020B Free RAM to just 25680B and 28344B (2.4.0-rc2)

And with ArduinoJson and SecureConnection my heap goes bye bye

devyte commented 7 years ago

@earlephilhower the C++ compiler is stricter, and there are differences in how code is optimized. In our project, speed went up by 5-7%, and out of about 200+ bugs, 110 were found straight up due to the C++ compiler complaining about something weird in the code. Of course, that was just one project, and it's entirely possible that the same bugs could have been found in other ways, but the experience was quite sobering overall. For the record, I do NOT expect to maintain patches here just for compiling with C++. Like @d-a-v said, any code changes for that should get submitted upstream to the lwip repo. What we gain out of the box is scopes and easier C++ hooks, in case we need them. BTW, I am not pushing for this as a must-have, but I do think that some experimentation is worth it.

penfold42 commented 7 years ago

@MrD2 have you tried the readelf to see where the heap went ?

dioguerra commented 7 years ago

@penfold42 I shall check

d-a-v commented 6 years ago

I would like to mention here #3978. @earlephilhower suggested a very cheap update for a huge win, the goo'old -DNDEBUG. Thanks

d-a-v commented 6 years ago

About dns host name storage in lwIP:

well, that's a good question. I do know the RFC says 256 octets, but practically on a tiny ESP8266 you're likely only initiating connections to 1 or 2 servers with 32-64 byte hostnames at any time.

It seems that the esp is actually dealing with long hostnames (issue)

earlephilhower commented 6 years ago

I'm always willing to be swayed by real data on the DNS stuff! One day I'll get around to adding the MEMPOOL_DNS and configuration options and submitting, but it's probably going to be a bit. If it's got to be 256, I guess we're going to have bite the bullet. Maybe only 2 or 3 and not 4 cached entries, though, instead of 4?

While we're undoing changes, I'd also suggest undoing the MSS shrink I think was done (to 576 or whatever the min TCP size is) as it wreaks havoc on streaming applications. An MP3 or AAC frame fits in 1 packet with normal ~1400 MSS...breaking each packet into 3 smaller ones kills real-time performance.

d-a-v commented 6 years ago

@earlephilhower

about DNS: why not try 3*128 ?

about MSS: I've had a hard time with lwip2 and 1460. My crash-tests just fail with 1460. I have not figured out why yet, I'm suspecting the link layer... But 768 is OK though which makes 2 packets instead of 3 for 1400 bytes. I'll try a higher one, I'll try to make it live configurable, or making it easy to change via menus?. You still can recompile and test with 1460.

devyte commented 6 years ago

@d-a-v about MSS, out of sheer curiosity: what happens with only slightly lower values, like 1459? 1448?

d-a-v commented 6 years ago

The tests I'm running are simple parallel tcp-echo services. With 536/768 I can run many of them for hours. With 1460 a single one only run for seconds before failing. I will bissect eventually. With lwip1 parallel tests fail whatever mss I use.

earlephilhower commented 6 years ago

@d-a-v Thanks for looking into this.

For DNS entries, that looks good but unfortunately I can't say if that's enough for AWS and Azure. Digging around at Amazon EC2, it seems like even hitting 70 characters will be hard (so 128 would be fine), but that's from my math and not any testing. Azure public hostnames are a mystery to me...

WRT MSS, I'll need to dig some more. Isn't the V1.x LWIP running with ~1500 now? The GIT revision I've got, for some reason, dies very quickly if I try the LWiP 2.0 option but I haven't had time to dig into it. Too many things in motion makes it hard to figure out my own (I hope!) memory corruption (mostly heap exhaustion or stack overflow), so I've kind of ignored v2.0 for now...

d-a-v commented 6 years ago

I spoke too fast. I recently introduced a lwip2 bug in my repo while fixing another. MSS=1460 is OK. It takes a lot of ram though. Ram-people or parallel-connections-people might like 536, while latency-people need 1460. The esp did not like it when I tried to make mss live-configurable.

We could have precompiled selectable libs (536 .. 1460), or selectable sizes live recompiled from menu (same maner as v1.4 opensource), or simply a cmdline walkthrough.

earlephilhower commented 6 years ago

@d-a-v How can we get your latest LWIP2 changes? Is it simply a matter of pulling the Arduino repo? I don't think I saw anything related to LWIP in a while in my pulls.

One other memory reduction suggestion: The SD library has a static, 512b entry for the last read sector. Even if you don't use anything at all in the SD library, if a single library has a #include of the file (in a class that's never instantiated in my case), you get that static buffer allocated on the heap. Turning that into a dynamic allocation on SD->begin() and making the static variable a simple * would save 512-4 = 508 bytes.

d-a-v commented 6 years ago

@earlephilhower lwip2 is a submodule which is precompiled. In order to try its master, you can:

cd tools/sdk/lwip2
make  # straightforward way to clone the repo (for now)
cd builder 
# submodule checks a previous version out
git checkout master
# find TCP_MSS and try another value, 1460 for your MP3 libs
vi glue-lwip/arduino/lwipopts.h
cd ..
make

then rebuild your sketch.

It has to be noted that I always worked with MSS=536 and lwip2 is currently using it.

I recently made tests with 1460 and it is not 100% stable. For testing this I shake the esp with 3 concurrent long-run tcp-echo tests along with 2 repeated short-run tests. This test has so far proven to be stable with MSS up to 896. Last try lasted ~15 hours, 18GBytes and 3.4 millions of connections (avg 2.8Mbits/s).

It is not stable with 1024 (data corruption), and not at all stable with 1460 (tcp eventually stopping). The reason is unknown yet, and of course the underlying closed-source link layer is suspected :) There is this unused lib tools/sdk/lib/liblwip_536.a which purpose is unclear apart from providing an implementation with 536. Since 1460 is default MSS for lwip-1.4 and most of sketches are running with it, you can of course try lwip2 with 1460.

@devyte and I suggest to propose 3 precompiled versions of lwip2, with 896 (default, stable, better perfs), 536 (stable, small footprint) and 1460 (for streaming latency, but instable) selectable in the current lwip ~flavour~ variant menu.

d-a-v commented 6 years ago

@earlephilhower could you please try latest git with lwip2 mss=1460, your suggested NDEBUG option and see if this all works with your low latency streaming requirements ? All is now selectable in menus

earlephilhower commented 6 years ago

@d-a-v Got it. Will check tomorrow and update. Thx!

earlephilhower commented 6 years ago

Happy New Year, all.

@d-a-v : I've done some testing this morning and here's what I've got: . Menus are very handy, thanks. . LWIP v2 @ 536 + NDEBUG gives ~1K less memory than LWIP V1.4 @ MSS1460, and not seeing any mem leaks (yay!) . LWIP v2 @ 536 + NDEBUG seems stable, but unusable for streaming as it breaks up horribly. Needs further investigation as to whether it's the TCP stack eating up way more time in each call or if it's buffer underflowing. . LWIP v2 @ 1460 + NDEBUG gives ~2K less than v1.4, pushing mp3 decoding into a spot where it's hard to say if any hangs/crashes are due to LWIP, the OS, or my own code hitting out of memory. . LWIP v2 @ 1460, when running, has no hiccups or data corruption I can detect (granted, I'm not doing a hard check here). . v2 @ 1460 has run for 6 minutes of audio as I write this with my app's free memory varying between 1K(!) and 2.5K. As long as no add'l connections are attempted by the ESP8266 or web requests sent to it, it seems as stable as v1.4 AFAICT.

Would it be worthwhile for me to build @ 896 and give it a go?

Thx -EFP3

d-a-v commented 6 years ago

896 has proven to be very stable (as stable as 536), where 1024 is not. With 896 you need two packets to make 1400 bytes, where you need 3 with 536. So yes, it's worth a try since you'll get more RAM left.

cd tools/sdk/lwip2
make download
vi builder/Makefile.arduino

change 536 to 896 (twice)

make install

You'll still see 536 in menu though, but you will have 896 instead.

earlephilhower commented 6 years ago

896 is definitely an improvement over 536, but it goes from about a 1/2 duty cycle of play/stutter to a 2/3 duty cycle which unfortunately still isn't usable. Memory usage seems to have gone up ~0.5K (not scientific as the measurement fluctuates during runtime).

d-a-v commented 6 years ago

So the conclusion is that v2@1460 works but takes a little more ram than v1@1460. I had to increase back dns cache name length to (256->48->)128 since 48 was too small for some IOT clouds. We are back to the OT's subject which is to improve available heap. I'll check once again lwipopt.h if there are other things we can modify. Feel free to report back if you find something.

earlephilhower commented 6 years ago

Additional dynamic heap savings could be had by rewriting the sprintf_P() function to actually use th PROGMEM format string directly from ROM, and not doing a malloc(), strcpy_P, then free(). Just ran into that in my own app where the index.html string is ~1.5K and the heap was too fragmented to find enough space. I ended up writing my own mysprintf_P(), although the workaround could also have been to sprintf_P() line-by-line.

This would also allow for %S == PSTR string formats, since you're probably stuck rewriting the whole vsprintf() function anyway.

d-a-v commented 6 years ago

I was thinking that os_printf_plus() indifferently allows char pointers to be in ram or flash. I did not check if there was a vsnprintf_plus() too.

I have a class made to behave like a char* but which is able to automatically handle data whether it is in ram, flash (or eeprom on avr or wherever data could be). By using it there is no need to make any distinction on data location. Memory footprint is at most a pointer size (32 bits on esp, 16 bits on avr), it cannot be used as an array (char by char loops are needed), it is not fast, but it helps a lot making simple code without needing to copy/transfer data or_P duplicate_P all_P functions_P. I was wondering if it could be useful for ssl libraries (lots of large flash->ram copies there) or, since we are now at rewriting vsnprintf() - which I did for this class - if it could help ESP saving the precious heap. link to ustr

If we do rewrite vsnprintf() please don't forget %i which I miss a lot.

devyte commented 6 years ago

String seems to currently always allocate up to 16 bytes more than requested, even when using reserve(): The impact is that an application that uses many String objects will waste a lot of heap. The calls to reserve() and changeBuffer() pass a size value as argument, which never takes into account the null term of a string (i.e.: strlen()). The only reason the code works is because of the extra allocated space. The correct behavior should be: -reserve() resizes to the requested size+1 to make the buffer tight (that's half of the point of the method) -other internal calls to reserve() or changeBuffer() should pass size+some extra value to handle the buffer growing

earlephilhower commented 6 years ago

@devyte About the String.resize(), isn't it just trying be allocating to the internal umm blocksize, and adding space for \0 in doing so?

IIRC, though, UMM is running on 8-byte chunks, not 16, so a better line would be: size_t newSize = (maxStrLen + 1 /* space for terminating \0 */ + 7) & (~0x7); see the umm readme.mdfile:

Each memory block holds 8 bytes, and there are up to 32767 blocks available, for about 256K of heap space. If that's not enough, you can always add more data bytes to the body of the memory block at the expense of free block size overhead.

There's no savings in allocating less than an 8-byte chunk since the remaining bytes are lost, anyway,

@d-a-v There are so many changes in 2.4 I can't be sure, but I think that they're going w/newlib's vs_sprintf code and not os_vsprintf. I'd rather not (poorly) roll my own sprintf() if we can take a nice piece of proven code and apply a minor tweak.

devyte commented 6 years ago

@earlephilhower the null term is implicit in how the calculation works. I suppose that's ok, given that if:

I reserve 1-15 bytes, the allocation is done for 16 bytes
I reserve 16-31 bytes, the allocation is done for 32 bytes So the null is silently added. It's the implicitness that rubbed me wrong. But alright.

About the 16 bytes at a time. I don't see anything about a block size in umm_malloc_cfg.h. A quick glance at README does show that a block has 8 bytes.

So, I guess the test here would be to reduce 16 to 8 in changeBuffer(), like you said.

earlephilhower commented 6 years ago

I've just sent 4 small pull requests to @igrr 's newlib repo which take care of the memory issues I noted above. Most apps should see around 500 bytes add'l static heap, and if the printf() patch is merged they'll need no add'l dynamic memory for the printf_P()s (actually, there's no need for printf_P() anyway).

@devyte 's note of String is not involved, since that's an Arduino core thing.

phoddie commented 6 years ago

Perhaps this is naive, but it looks possible to save 192 bytes by allowing dead stripping interrupt_handlers when attachInterrupt is unused by deferring ISR installation. Here's a rough attempt.

steminabox commented 6 years ago

I think this is the biggest issue in the 8266 library today.. I found this thread when a relatively small program of mine on the 8266 started acting strangely - yep, memory problems.. The problem? The ardunio library (2.4.1) is taking up too many resources! A compile of an 'empty' program is taking up .data 1276 .text 29147 .rodata 1944 .bss 29168

so going from https://tech.scargill.net/esp8266-ram/ .data + .rodata+.bss = 32388 which leaves about 50K free memory - which is what I get when I check. The problems is another limit - the .text section at 29147 is close to its 32k limit - I think is what killed my program (it started going flaky once .text was bigger than 32k...)

Am I an idiot or is this library now so big it doesn't leave scope for anyone to do anything with it that isn't trivial?

steminabox commented 6 years ago

.. and doesn't that .text somehow end up in ram too?

devyte commented 6 years ago

@phoddie all ideas are open for discussion, especially if somebody already has preliminary results. Can you make a pr with that rough attempt, so that the changes can easily be discussed?

devyte commented 6 years ago

@steminabox please don't confuse heap RAM with IRAM. Yes, the core libs have grown, because a lot of functionality has been added. Most recently, floating point support for printing to string. Some things have been movednto IRAM, some things to flash. However, the ESPs resources are very limited, so it's still very tight. Although the ESP has 80KB heap total, a lot of that is taken up by Espressif's sdk for the wifi, and then by lwip for the tcp/ip stack. When I picked up this core, an empty sketch showed just over 40KB of free heap. Now we're at 50KB, and there are still ideas to try for squeezing out more. With one idea currently being discussed, I think this firmware will be the one that offers the most free heap with an empty application. About IRAM, the ESP has 32KB available, and there is much code that currently can't be executed directly from flash. There are ideas to provide flexibility for where some code is put (flash/IRAM), and other ideas for relaxing some restrictions (ISRs). There is also the possibility of improving things with better optimizations by a newer gcc. If this is an issue for you, and/or you find it interesting, I suggest you research, pay attention to the repo discussions, inspect the repo code and binaries that result from the build, and then chime in if you have an idea for improvement, or if there is a task that you think you could handle. New ideas and contribution efforts are always welcome.

steminabox commented 6 years ago

yep, I've been reading more about it - and IRAM is obviously the problem. Once .text hits 31204 the program is stuffed... So free heap is almost irrelevant. With .text at with a completely empty program (and no optional libraries attached) the ardunio library has only left 2057 bytes for the user!! ie the library has used 93% of the IRAM! I can't get much bigger programs to run than I did on a UNO!

Why is the library doing that? ie how much of that IRAM really needs to be there? eg why on earth (using your above example) has floating point been put in as a default?

So instead of the end user having to go through a lot of work - and still not fitting into the remaining 7% of IRAM, it would be better to fix the library by taking out all the optional stuff by default, and allow users to shift it back in if needed... Otherwise this library is getting close to useless for anything needing more than a page or two of code... It would be much better to use 10k more heap (when needed) and 10K less IRAM! So yes, I can look at all the code and sort it out, I just have to decide if it is worth the effort to try and work around this limitation with this library (ie change bits or flog code to make a minimalist version) or do I swap to the esp32 (The user iram there seems to be about 130K)...

But that's just me - the bigger philosophical question is why is the 8266 ardunio library growing in this direction, ie becoming bloatware?

igrr commented 6 years ago

Although off-topic in this issue, one more idea to have more IRAM is to only use 16KB for the cache, instead of the 32KB used now. That would free up 16KB for the application.

Regarding IRAM usage by the core itself. Since user applications rarely need IRAM (interrupt handlers are the only case), IRAM usage was never really part of the optimization effort (until very recently). Instead of trying to free up IRAM I think it might be worth making a few changes to allow interrupt handlers to go into Flash.

steminabox commented 6 years ago

I got rid of my two small user interrupt routines (while trying to find where the space was going), so they aren't the problem. So something else must be using it.. How would I change the cache (like you mention above)?

steminabox commented 6 years ago

OK, I found the thread over here - https://github.com/esp8266/Arduino/issues/4551 - which is more discuss the major IRAM problem. It seems it's even worse if you take the changes since 2.4.1 (ie only 440 bytes available according to one post). This is pretty much a complete show stopper for the library, unless you have a trivial program that doesn't do much...

esp8266 / Arduino

Improve available heap #3740

Hardware

Problem Description

define DEBUG_MSG(_1, ...) DEBUG_ESP_PORT.printf_P(PSTR(_1), ##__VA_ARGS__)

else

define DEBUG_MSG(...)

endif`