PSRAM Cache Issue stills exist (IDFGH-31)

ThomasRogg commented 5 years ago

We stumbled upon the fact that cache issue with PSRAM still exist, even in the newest development environment. This can produce random crashes, even if the code is 100 % valid.

This very small example program reproduces the problem easily, at least if compiled with newest ESP-IDF and toolchain under Mac OS X (did not try other environments): https://github.com/neonious/memcrash-esp32/

(As a side note: We noticed this problem when we implemented the dlmalloc memory allocator in a fork of ESP-IDF. We worked around this problem (hopefully you can fix it correctly), and now have an ESP-IDF with far faster allocations. Take a look at the blog post here: https://www.neonious-basics.com/index.php/2018/12/27/faster-optimized-esp-idf-fork-psram-issues/ ).

Alvin1Zhang commented 5 years ago

@neoniousTR Hi, neoniousTR, thanks for reporting this, we will look into this and update if any feedbacks. Also there is a topic about the issue on our forum at http://bbs.esp32.com/viewtopic.php?f=13&t=8628&sid=1acc8bd897e72cf450ad9eb71491d732. Thanks.

ThomasRogg commented 5 years ago

We updated the example project at https://github.com/neonious/memcrash-esp32 It is now leaner, more to the point, and most importantly, compiles out of the box.

We think this problem is urgent to fix, as random crashes can occur to anyone using the PSRAM of the ESP32.

There only seems to be two workarounds:

Use only the first 2 MB of 4 MB of PSRAM (big penalty)
End every function which stores to PSRAM with a memw instruction (slow). nops do not help.

Please take a look at the project, and hopefully you have a better idea.

Spritetm commented 5 years ago

Fyi, we're working on this. For what it's worth, it seems to be caused by an interrupt (in your example, the FreeRTOS tick interrupt) firing while some cache activity is going on. We have our digital team running simulations to see what exactly is going on in the hardware; we hope to create a better workaround than the memw solution from that.

ThomasRogg commented 5 years ago

Good that you can reproduce this. Interrupts are a good explanation why this happens only randomly.. Hoping for the best.

markwj commented 5 years ago

We seem to be seeing this as a std::string memory corruption (all zeros, on a 4 byte boundary).

In our case, disabling the top 2MB of SPIRAM didn't seem to work. But pre-allocating 2MB (which we then never use) seemed to workaround the problem. Our code runs primarily on core #1.

This is impacting us quite badly. Lots of random corruptions and crashes with devices in the field.

ThomasRogg commented 5 years ago

Maybe whether the top or bottom of the RAM works depends on the core used.

dexterbg commented 5 years ago

Confirmed: running our test project from #3006 with that 2 MB allocation and also starting the test task on core 0 shows the corruptions again. It seems core 0 can only work reliably with the lower 2 MB and core 1 only with the higher 2 MB.

ThomasRogg commented 5 years ago

@Spritetm or @Alvin1Zhang As this issue does not happen in single core mode, do you know if the original PSRAM cache issue which is fixed with the flag and adds many nops and memws is also only in dual core mode?

If so, we will try to switch low.js to single core mode, this might even be faster at the end, because the JavaScript itself is single core anyhow and has the most load.

Also, how is the progress going? I'd think the chances are to get this fixed by modifying the interrupt handlers or the cache fetchers and savers (they are part of the ROM?).

xbary commented 5 years ago

Hello, I wanted to add to the subject the error I observed in my application using PSRAM. Random error while retrieving the amount of free PSRAM memory. In my application, I check the amount of free PSRAM memory in the main loop, and differently I received in reply that, for example, 16 bytes of free memory, but at the next check, it actually answered ~ 4mb.

In my opinion, there must have been an erroneous random reading from PSRAM.

ThomasRogg commented 5 years ago

Our current status:

Load to cache/Write from cache does not seem to be interrupt-based.. Might be 100% hardware-based?

Added memw to interrupt handlers does not change anything.

Currently we believe Dual Core + PSRAM is a broken combination.

So we will completly switch to Unicore now.

Please answer: Do you know if the original PSRAM cache issue also exist in unicore mode? Would be great if we can get rid of the nops and memws with this once and for all...

xbary commented 5 years ago

I made such an experience, I rewrote the String class from the arduino project to use the PSRAM memory. I changed the name and changed the realloc in the changebuffer function. Suddenly, it turned out that my application did not regularly show 0x00 in one cell. here is this changed String class: https://github.com/xbary/xb_StringPSRAM I would like you to be able to replace the class with StringSRAM as part of the tests, you may be able to reproduce the repeatability of the error.

ThomasRogg commented 5 years ago

I have to confirm that the original PSRAM workaround is still required in Unicore mode.

So Workaround + Unicore is the only combination which works reliably with PSRAM. If I am wrong, I hope somebody will post. Otherwise we have to take this as a fact ...

me21 commented 5 years ago

Can dual core chip be switched to unicore mode? Does this error manifest itself in Arduino framework? As far as I know, Arduino task is pinned to core 0, therefore, it can be effectively viewed as unicore. Am I right?

ThomasRogg commented 5 years ago

Yes the CPU can be configured for unicore.

Pinning to one core does not help, as in dual core mode the other 2 MB are handled by the other core.

Spritetm commented 5 years ago

FWIW, we have a tentative solution for this; the existing workaround solution does actually seem to work but doesn't take calls/returns into account properly. We'll ship a toolchain with improved workaround code soon, but we want to have this fairly well tested so we don't have any other edge cases sneaking past us. I'll see if I can post a preliminary patch as soon as I have something halfway stable,

markwj commented 5 years ago

Do have any idea of schedule for this, or an ability to get us a pre-release toolchain?

This is impacting us quite badly. The 2MB pre-allocation solves the problem for our code, but just shifts the problem to wifi running on the first core (which now experiences random errors and throughput problems).

xbary commented 5 years ago

I confirm, the error still occurs at random moments, even hangs completely.

markwj commented 5 years ago

Do have any idea of schedule for this, or an ability to get us a pre-release toolchain?

dexterbg commented 5 years ago

@Spritetm We do appreciate your efforts in making sure your patch is perfect. But meanwhile our system has to bear a huge performance hit by the workaround, while the stability is still impacted by the bug. We're more than willing to help you in beta testing your patch by using it on our project. Please do a pre-release or share some update on the status. Thanks!

Patrik-Berglund commented 5 years ago

Also think an update is in place, we are awaiting to see if you are able to fix this bug or if it makes the PSRAM feature unusable.

We need more RAM than internal available in the ESP32, so this is a deal breaker for our product.

negativekelvin commented 5 years ago

Just wondering why if the original workaround should work in this case that forcing nops does not resolve it.

400d4b3c:   1047a5          call8   400e4fb8 <crash_set_both>
400d4b3f:   f03d        nop.n
400d4b41:   f03d        nop.n
400d4b43:   f03d        nop.n
400d4b45:   f03d        nop.n
400d4b47:   0228        l32i.n  a2, a2, 0

400e4fb8 <crash_set_both>:
400e4fb8:   004136          entry   a1, 32
400e4fbb:   0249        s32i.n  a4, a2, 0
400e4fbd:   0349        s32i.n  a4, a3, 0
400e4fbf:   f03d        nop.n
400e4fc1:   f03d        nop.n
400e4fc3:   f03d        nop.n
400e4fc5:   f03d        nop.n
400e4fc7:   f01d        retw.n
400e4fc9:   000000          ill

Also I noticed the workaround will add the nops even when there is already a memw barrier.

400d4e86:   03a9        s32i.n  a10, a3, 0
400d4e88:   0020c0          memw
400d4e8b:   01a9        s32i.n  a10, a1, 0
400d4e8d:   f03d        nop.n
400d4e8f:   f03d        nop.n
400d4e91:   f03d        nop.n
400d4e93:   f03d        nop.n
400d4e95:   0020c0          memw
400d4e98:   0138        l32i.n  a3, a1, 0

Spritetm commented 5 years ago

From what I can see, the load-store inversion doesn't occur there exactly... the issue has something to do with a cache miss around that time that has delayed effects later. Because a cache miss takes a while to resolve, you can fix it with nops but you'd need to put a gazillion of them there.

I also noticed the nop/memw interaction... will see if I can get rid of that as well, inasfar gcc marks it. (As in: I can probably detect volatiles that cause an implicit memw, but a literal asm("memw") is harder to spot.)

negativekelvin commented 5 years ago

Ok thanks. Other things I noticed when playing with the memcrash-esp32 example:

If running on core 0 error will occur with mem2 (lower) in HIGH-LOW mode
If running on core 1 error will occur with mem1 (upper) in HIGH-LOW mode
If running on core 0 error will occur with both mem1 & mem2 even in EVEN-ODD mode
If running on core 1 error will occur with both mem1 & mem2 odd in EVEN-ODD mode
Error does not seem to happen in NORMAL mode on either core ( I don't know if this is supported or just giving a false result)

Assuming running NORMAL mode is valid workaround with a performance trade-off, how will it compare to performance of the planned workaround?

negativekelvin commented 5 years ago

More info: the memw in the example actually does not prevent the error when the routine is running in parallel on both cores. It is much more infrequent but still happens.

Normal mode does have a performance cost as it is around 24% fewer tries/ms with the example on both cores, but no errors.

ThomasRogg commented 5 years ago

Normal mode has cache coherency issues according to documentation, so not an option.

negativekelvin commented 5 years ago

Yes I see, makes sense. Back to the drawing board then.

https://github.com/espressif/esp-idf/blob/a62cbfec9ab681bddf8f8e45d9949b1b1f67a2ec/components/esp32/spiram_psram.h#L39-L41

Spritetm commented 5 years ago

Okay, I have a toolchain update that has a different psram workaround. You can grab the binaries for Linux, Windows or OSX. By default, it should use the workaround; switch to the new one by putting

CFLAGS+=-mfix-esp32-psram-cache-dupldst
CXXFLAGS+=-mfix-esp32-psram-cache-dupldst

into your Makefile (or do the equivalent for cmake), recompile and test.

Note that this is not an 100% fix as the binary libraries (wifi/bt/libstdc/libc++/libgcc) at the moment are not compiled with this workaround, but I don't think I have seen any issues that originated there yet, so as a first test this should cover most use cases.

ThomasRogg commented 5 years ago

Thank you. Can you explain the workaround? What does it do / what penalty does it have?

Spritetm commented 5 years ago

It's actualy almost shamefully trivial for how well it works... every store of a 32-bit value to a memory location that is not volatile, will have a load from that memory location appended to it. I haven't ran any benchmarks, but I don't think it affects speed more than the NOP solution. In theory, it may break storing a word to a peripheral (as the load after may trigger something), but as those should be declared volatile, I don't think that happens in practice.

It also includes a fix for a different issue, by the way: there's a workaround where a memw is inserted after an 8/16-bit store in some situations, but the workaround code didn't handle the end of functions correctly. That's been fixed as well.

ThomasRogg commented 5 years ago

Thank you.

Is there a schedule when the binaries will exist fixed? Is it realistic that all code in ROM will be patched, too (btw was this already done completly with the last psram workaround? This would be good because we should not expect more code to be patched, just a new patch, correct?).

Spritetm commented 5 years ago

We first need to know if this fix 1. actually fixes the issue and 2. doesn't introduce more problems than it fixes. Once that's established, we'll also recompile all binary libraries with it. The amount of fixed code should indeed be the same as the existing psram fix.

negativekelvin commented 5 years ago

The windows binaries have symlinks instead of copies.

dexterbg commented 5 years ago

The good news: using the workaround my test project from #3006 no longer produces the corruptions in the char buffer tests (run_test_buffer).

The bad news: the std::string tests have corruptions as before. Did you build the included libstdc++ using the workaround? As we mostly use std::string for buffers, we won't see much of an improvement without a fixed std::string class.

dexterbg commented 5 years ago

Update: looking into the toolchain I found a secondary build of the libs in xtensa-esp32-elf/xtensa-esp32-elf/sysroot/lib/esp32-psram, so I gave those linker precedence and voila: the std::string tests now also work without corruptions!

So it seems the -m flag misses setting that path. I've added this to my main/component.mk:

COMPONENT_ADD_LDFLAGS = -lmain -L$(HOME)/esp/xtensa-esp32-elf/xtensa-esp32-elf/sysroot/lib/esp32-psram

There surely is a better method to add this to the project Makefile I haven't found yet.

I'll now try building our main project with the workaround.

dexterbg commented 5 years ago

Not a single data corruption yet & no new issues so far, I'm publishing the build to our beta testers now.

negativekelvin commented 5 years ago

~Yeah it only happens on the order of 1 per million tries when hammering the cache from both cores so it may be very rare in a practical application~

@neoniousTR makes an interesting point about passing an extram pointer to a rom function which may perform a store and then return to user code which performs a load

My other question would be whether the compiler is aware of our extra load and since a store/load is the trigger can we guarantee that the register will maintain its value since it may be used again in subsequent instructions?

Spritetm commented 5 years ago

@negativekelvin That doesn't mesh with my tests. What setup do you use? I tried the sources at https://github.com/neonious/memcrash-esp32/ with only the Makefile modified to use the master esp-idf and to add the -mfix-esp32-psram-cache-dupldst flag. I've ran the test for a good million ticks now and haven't seen any errors.

Wrt ROM functions: All the relevant ROM functions (mostly Newlib) get placed in flash with the workaround compiled in when the PSRAM workaround is enabled. That's the case for the old workaround and will also be done for the new workaround as soon as I have an indication it works well enough and doesn't lead to other issues. Same goes for all other libraries.

negativekelvin commented 5 years ago

@Spritetm my version here https://github.com/negativekelvin/memcrash-esp32

Spritetm commented 5 years ago

Thanks, that one also fails for me. I'll look into what the issue is there...

Edit: Errm... Wouldn't your test fail anyway if both cores decide to write to the same random address, even if the hardware works OK? I think that's the issue, as if I modify the task on core 0 to only write to even words and core 1 to only write to odd words, I get no errors anymore.

negativekelvin commented 5 years ago

Ah ok I think you got me there, ...brain fart. I guess it is working then!

negativekelvin commented 5 years ago

Just want to mention I am still seeing an error in EVENODD mode on the order of 1 in 100 million tries. I don't have any particular need to use this mode though.

ThomasRogg commented 5 years ago

That EVENODD does not work 100% lets me think: Could the fix maybe not work if the address which you read from which is appended to the one you write in is in the other half of the memory? Right on the boundary?

negativekelvin commented 5 years ago

Not sure, there are definitely more boundaries in EVENODD but the extra generated load is from the same address as the store.

As of now I also have observed 11/3 errors respectively on each core out of 18 billion tries using LOWHIGH.

Anyone else can reproduce this on their hardware?

dexterbg commented 5 years ago

Test results so far: improved stability & network performance, no new issues. Looks very good.

Spritetm commented 5 years ago

@negativekelvin I can replicate the few-times-in-a-billion error with your code. It's a weird error: while the previous errors could be influenced by interrupts (as in: more interrupts = more errors), this error is impervious to interrupt flooding, which makes me think it's caused by something else.

Spritetm commented 5 years ago

New version of the toolchain is here: Linux MacOS WIndows (note the Windows toolchain probably still has symlinks instead of copies)

In this version, the load-after-store is the default psram fix; there's no need to use -mfix-esp32-psram-cache-dupldst in your Makefile anymore. The old fix still is available: specify -mfix-esp32-psram-cache-nops to use it. Because of this change, all the compiler libraries (libgcc, libstdc++ etc) should also be compiled with the new fix.

ThomasRogg commented 5 years ago

A few times in a billion IMHO is too much. Customers expect uptimes of years (our PC servers manage that easily), and if such an error propagates to the file system, data might even be corrupted. Have you tried turning down the Mhz? Maybe it is simply a bad signal problem.

ThomasRogg commented 5 years ago

Question 2: I think there are binary blobs in ESP-IDF, too, such as Wifi? If I am correct, how are these updated?

Spritetm commented 5 years ago

@neoniousTR Agreed entirely, and we're not done with this issue yet. However, seeing that this already is a pretty large improvement on the previous situation and I can see the remaining issue potentially taking a fair while to catch, I'd rather start giving out beta tests and updating early. WiFi and Newlib need to be updated in an esp-idf release; as soon as the gcc release is stable I'll see if I can get those recompiled as well.

ThomasRogg commented 5 years ago

@Spritetm: I just sent you a message via the contact form of your website SpritesMod.com, as I did not find any other way to contact you in private. Please look for this mail. Thank you!

espressif / esp-idf

PSRAM Cache Issue stills exist (IDFGH-31) #2892