SuperHouse / esp-open-rtos

Open source FreeRTOS-based ESP8266 software framework
BSD 3-Clause "New" or "Revised" License
1.52k stars 491 forks source link

Malloc regions #584

Open ourairquality opened 6 years ago

ourairquality commented 6 years ago

Add support for using the spare iram and optionally part of the icache as part of the malloc heap. This can add roughly 20k to the malloc heap, but with conditions.

To use the icache for the malloc heap, compile with -DESP8266_ENABLE_LOW_ICACHE There is obviously a trade off to be made, but if desperate for memory then this might be an option.

Newlib has been extended to be able to allocate from two separate regions. This is controlled per-thread by calling set_malloc_regions(). There are four options: MALLOC_MASK_DRAM and MALLOC_MASK_IRAM to allocate from only the dram or iram respectively, and MALLOC_MASK_PREFER_DRAM and MALLOC_MASK_PREFER_IRAM to prefer dram or iram respectively but to use the other region if necessary. The function set_malloc_regions() returns the prior option, so this can be used to make a change for some dynamic code block, for example if calls to malloc in a dynamic context need to be in a particular region. This is used in lwip and FreeRTOS where necessary to ensure buffers are in dram, so those are some example. The wificfg example has been updated to report both the dram and iram usage. The amount of free memory reported by mallinfo() also depends on the option set by set_malloc_regions() so if allocating to only dram or iram then it returns only the free memory in that region, and this is also thread local, and there is an example in lwip where it checks the available dram.

There are some hacks involved, particularly on the newlib side, it is not pretty with hard coded heuristics and addresses specific to esp-open-rtos, but it might be some time before that is cleaned up and it should not be broken, just not pretty. For example there is a heuristic to preference iram if the dram is getting low to try to ensure that there is room to allocate the lwip pbufs as the TX buffers appear to need to be in dram.

The defaults should be a good start, preferring dram but using iram when necessary.

Been testing it for a little while now, and it seems to be holding up. Testing and feedback welcomed.

ourairquality commented 6 years ago

Unfortunately this is still not quite right. Reports of wifi beacon timeouts when using the lower icache for the malloc heap. It appears to actually be the hit losing the extra icache, as the problem is reproducible even when the lower icache has no data in it. Still there is a small gain using the unused iram.

Zaltora commented 6 years ago

I am in for test ! Like you know i am desperate for more RAM. My app is hungry. My recent work are specific to GUI system called lvgl, a project open source to design advanced GUI for many different screens. I want add it as component to esp-open-rtos. It is work already. i use it with ssd1306. With your update, i will specify to lvgl to use I-RAM instead of dram.

Zaltora commented 6 years ago

why beacon is impacted by this change ? Maybe some wifi primitive function need to be put in IRAM to avoid to load it from flash than can be the reason of the extra time ?

ourairquality commented 6 years ago

@Zaltora Are you seeing the beacon timeout too? Perhaps moving some code to iram would solve this, but I have not been able to nail it down. I suspect it has something to do with the sdk libpp, and there is a NMI Wifi interrupt handler there that might be timing critical. I don't see the beacon timeout when just using the iram and not the icache. It would be great to unlock that extra 16k, and it seem so close.

Zaltora commented 6 years ago

I didn't remember to get problem when i manually disable ICACHE bank 0 with #510. I will do a text to confirm. I don't no how to enable icache with your pull request. i add this in my makefile: EXTRA_C_CXX_FLAGS = -DESP8266_ENABLE_LOW_ICACHE but seem do not work. what is the good way ?

First remark, i got more HEAP than with official RTOS (global stream update ?) I got ~2k more. That is good.

ourairquality commented 6 years ago

It needs to be given 0 to disable the low icache, and that should report an extra 16k. We can rework the options and interface based on feedback, if it proves useful.

EXTRA_C_CXX_FLAGS = -DESP8266_ENABLE_LOW_ICACHE=0

Zaltora commented 6 years ago

Ok that my output with icache:

MAIN: SDK version:0.9.9

SDK versionMAIN: DPORT DPORT0                3FF00000   00000001
MAIN: DPORT UNKNOW                3FF00004      00000005
MAIN: DPORT UNKNOW                3FF00008      0000080F
MAIN: DPORT SPI_READY             3FF0000C      04000102
MAIN: DPORT UNKNOW                3FF00010      00000000
MAIN: DPORT CPU_CLOCK             3FF00014      00000000
MAIN: DPORT CLOCKGATE_WATCHDOG    3FF00018      FFFF00FF
MAIN: DPORT UNKNOW                3FF0001C      00000000
MAIN: DPORT SPI_INT_STATUS        3FF00020      00000010
MAIN: DPORT SPI_CACHE_RAM         3FF00024      0000001A
MAIN: DPORT PERI_IO               3FF00028      00000000
MAIN: DPORT SLC_TX_DESC_DEBUG     3FF0002C      00000000
MAIN: DPORT UNKNOW                3FF00030      00004040
MAIN: DPORT UNKNOW                3FF00034      00000000
MAIN: DPORT UNKNOW                3FF00038      00000041
MAIN: DPORT UNKNOW                3FF0003C      00000000
MAIN: DPORT UNKNOW                3FF00040      00000000
MAIN: DPORT UNKNOW                3FF00044      00000000
MAIN: DPORT UNKNOW                3FF00048      00000000
MAIN: DPORT UNKNOW                3FF0004C      00000000
MAIN: DPORT OTP_MAC0              3FF00050      F67E0000
MAIN: DPORT OTP_MAC1              3FF00054      020002D9
MAIN: DPORT OTP_CHIPID            3FF00058      AC00B000
MAIN: DPORT OTP_MAC2              3FF0005C      00DC4F22
MAIN: Heap: 53636
MAIN: void size: 4
MAIN: template : 11111111,11111111
data  : 0x3ffe8000 ~ 0x3ffe8964, len: 2404
rodata: 0x3ffe8968 ~ 0x3ffe91f8, len: 2192
bss   : 0x3ffe91f8 ~ 0x3fff1638, len: 33856
heap  : 0x3fff1638 ~ 0x40000000, len: 59848
MAIN: var addr: 3FFFBFC4
MAIN: var2 addr: 3FFFBFC0
mode : softAP(de:4f:22:02:d9:f6)
add if1
bcn 100

without icache:

MAIN: SDK version:0.9.9
SDK versionMAIN: DPORT DPORT0                3FF00000   00000001
MAIN: DPORT UNKNOW                3FF00004      00000005
MAIN: DPORT UNKNOW                3FF00008      0000080F
MAIN: DPORT SPI_READY             3FF0000C      00000102
MAIN: DPORT UNKNOW                3FF00010      00000000
MAIN: DPORT CPU_CLOCK             3FF00014      00000000
MAIN: DPORT CLOCKGATE_WATCHDOG    3FF00018      FFFF00FF
MAIN: DPORT UNKNOW                3FF0001C      00000000
MAIN: DPORT SPI_INT_STATUS        3FF00020      00000010
MAIN: DPORT SPI_CACHE_RAM         3FF00024      0000000A
MAIN: DPORT PERI_IO               3FF00028      00000000
MAIN: DPORT SLC_TX_DESC_DEBUG     3FF0002C      00000000
MAIN: DPORT UNKNOW                3FF00030      00004040
MAIN: DPORT UNKNOW                3FF00034      00000000
MAIN: DPORT UNKNOW                3FF00038      00000041
MAIN: DPORT UNKNOW                3FF0003C      00000000
MAIN: DPORT UNKNOW                3FF00040      00000000
MAIN: DPORT UNKNOW                3FF00044      00000000
MAIN: DPORT UNKNOW                3FF00048      00000000
MAIN: DPORT UNKNOW                3FF0004C      00000000
MAIN: DPORT OTP_MAC0              3FF00050      F67E0000
MAIN: DPORT OTP_MAC1              3FF00054      020002D9
MAIN: DPORT OTP_CHIPID            3FF00058      AC00B000
MAIN: DPORT OTP_MAC2              3FF0005C      00DC4F22
MAIN: Heap: 70020
MAIN: void size: 4
MAIN: template : 11111111,11111111
data  : 0x3ffe8000 ~ 0x3ffe8964, len: 2404
rodata: 0x3ffe8968 ~ 0x3ffe91f8, len: 2192
bss   : 0x3ffe91f8 ~ 0x3fff1638, len: 33856
heap  : 0x3fff1638 ~ 0x40000000, len: 59848
MAIN: var addr: 3FFFBFC4
MAIN: var2 addr: 3FFFBFC0
mode : softAP(de:4f:22:02:d9:f6)
add if1
bcn 100

I didn't see visible beacon problem for now. I will try with my app. global stream rework gain is bigger than expected : 3,5k well well !!

Zaltora commented 6 years ago

Ok tested my final app. no problem on serial output. Trying to connect with Wifi, no problem detected It is work well too. i got 30k heap free instead of 10K before your update

What problem you got with beacon? you got a message on serial output ?

Zaltora commented 6 years ago

the problem is reproducible even when the lower icache has no data in it

i didn't use icache right now. Just enable the option in makefile and it is work. I guess by default, DRAM is used first.

ourairquality commented 6 years ago

Yes, the default is to use the DRAM first, but that can be changed. I've been able to reproduce wifi problems. Still not sure what the cause is. It might be that the app is overloading the icache and performance is falling off a cliff, or there might be bugs in there somewhere. Shall keep exploring. If you could give it a try and report any problems that might give more clues.

Zaltora commented 6 years ago

how work the system with: set_malloc_regions(). i need call it before call functions than will use dynamic memory ? That not a problem with concurrent system ?

If i do this, only this task will use IRAM ?

Task1()
{
   set_malloc_regions(MALLOC_MASK_PREFER_IRAM);
   while(1)
   {
     //Dynamic memory functions here
   }
}
ourairquality commented 6 years ago

Yes that usage of set_malloc_regions() looks good. It is thread local, so each thread can have a different setting and each thread can change the setting for a dynamic code context without affecting other threads. The function returns the prior setting to allow restoring the setting on exit from a dynamic code context.

Zaltora commented 6 years ago

Wooo, when i define a malloc regions in a task, it like define a new propriety to the task. Nice system. I am really curious how it is work when coding.

i put set_malloc_regions(MALLOC_MASK_PREFER_IRAM); in my user_init() and screen task to see what happens. No Wifi problem detected yet. What can i do to force the wifi problem problem ?. I manage socket with lwip in a particular way (non blocking API). the wifi is set in AP mode.

ourairquality commented 6 years ago

The setting is stored in the newlib reent structure which is swapped when freertos changes threads. There might have been other ways, and the newlib code is not pretty, but that seems a detail that can be cleaned up later if this works.

The http_get_mbedtls example gives a beacon timeout, but it might have a large code footprint and it works the cpu hard. I tried an example loop just accessing the dram hard and see problems too, so there might be deeper problems when using only 16k of icache. Might also be seeing problems with the spi access code, when an nmi wifi interrupt occurs. Let me keep working though things to try to narrow them down. I have not spotted any data corruption issues though. It might not be a show stopper for some apps if the wifi loses some packets when the cpu is working hard, but that needs to be understand, need to be sure there are not other bugs.

ourairquality commented 6 years ago

Testing of the flash code with buffers in iram or icache has shown no problems. Tested a range of small buffers sizes, buffer offsets, flash offsets, all worked ok. Looking at the source code also suggests it should not make a difference if the source and target buffers are in dram or iram, except for performance, because the code uses regular load and store instructions to copy between the flash i/o buffers.

I have no more clues about wifi stability, not even sure if there is an issue. I made some of my code more robust to unreliable wifi and it seems ok. Was able to get the mbedt_get_http example running using larger buffers and the extra icache memory, and also the aws example using larger buffers.

Zaltora commented 6 years ago

I was thinking about the wifi problem. If the example you tested make cpu work hard like "http_get_mbedtls". Maybe the problem is related to this because use IRAM intensifies the use of cpu. What effect got to overclock the cpu to 160MHz if it is not the case ? the problem can be priority on interrupt which is not respected ? (even if cpu is at 100% ) (lock system was reworked for libc ( for timer1 interupt ))

I got some library i want they use other RAM than the current task use (e.g: WIFI task use DRAM to avoid problem related to use IRAM like tx buffer. The following code is good ? :

Task1()
{
   set_malloc_regions(MALLOC_MASK_PREFER_IRAM);
   while(1)
   {
     //Dynamic memory functions in IRAM
      uint32_t malloc_mask = set_malloc_regions(MALLOC_MASK_DRAM);     
      //Dynamic memory functions in DRAM
      set_malloc_regions(malloc_mask);
   }
}

set_malloc_regions(malloc_mask); will restore the old setting with IRAM ? I suggest to add an exemple with different RAM scenario for final user.

ourairquality commented 6 years ago

@Zaltora Yes, that code looks good. There are some examples in the lwip code and FreeRTOS code. I have not spotted any problems with the task priorities.

ourairquality commented 6 years ago

A bug in the changes has been found and fixed. There was an error in alignment adjustments, the result could be that malloc used an extra word outside the region supplied, but it depended on the alignment of the end of the text segment which depended on the build. It only affected the object at the very end of the region, which might have lost one word. So it was not an easy one to narrow down. If people had problems with this change then it might be worth giving it another go.

Zaltora commented 6 years ago

I got no problem since i used this PR in my app. Ready to merge ?

Zaltora commented 6 years ago

Some news about this PR ?

UncleRus commented 6 years ago

Going to merge it if there are no objections

ourairquality commented 6 years ago

The SDK appears to have added re-entry support to the load-store exception handler, and perhaps that needs to be considered. I have assumed that this exception handler could not be interrupted by the NMI but perhaps it can and that would be a problem. Also perhaps let me rebase it and check it is all updated.

ourairquality commented 6 years ago

A fix for the load-store exception handler being re-entered via a NMI has been added to this PR, and it has been rebased and lwip also updated. Would be curious to know if this fix addresses any issues people were having with this, such as the beacon timeouts?

Zaltora commented 6 years ago

Hi, i currently test your PR. I use my product as "wifi access point" and i connect to it with putty. 2 port ( cmd/dbg), i send and receive data without problem. I will continue testing few days.

Zaltora commented 6 years ago

I got a very strange bug with this PR. When i program, after esptool HW reset, everything work fine. After the first reboot, my program boot (screen initialized and logo appear) but system seem frozen after that ( wifi not work / uart either / button interaction either ). Need do more investigations. I got a lot of thing in my programm ( spiffs, pwm, wifi, i2c, spi, ... ). I use a lot of malloc/free too. Maybe a thing don't like to be allocated in IRAM.

ourairquality commented 6 years ago

The recent change was to the load/store exception handler, and if it booted up, got past user_start then the stack for that would have been initialized. Could you try it with the last patch reverted, so revert 'Load/store exception handler: handle re-entry via a NMI'?

Zaltora commented 6 years ago

After some additional tests: the problem feel more random. my system can freeze after i push the first button. One time, the second system boot was good but not the third. Attention: when i said freeze at boot, it is for my app. Because esp8266 do a lot of thing from my app before block. (My SSD1306 screen is always initialized for example)

How i can revert the last patch ? When i log the branch, i didn't see the corresponding commit.

ourairquality commented 6 years ago

Rebased to make the recent change the last commit, so just 'git checkout cc4bd3c' to remove that last change.

Zaltora commented 6 years ago

No changes, problems still here with this commit.