Closed wuyuanyi135 closed 1 year ago
@wuyuanyi135 do you have any other tasks in your application which use floating-point arithmetic? If yes, do these tasks run on core 0 or core 1? And, just to confirm, you aren't doing any floating-point arithmetic in ISRs?
@igrr Thanks for your response. No, I do not have floating point arithmetic in ISRs. I do have many other tasks running and they do use the floating point computation. This process task has priority of 1 so I guess preemption would occur constantly during data processing. Most tasks were pinned to core 0. But some tasks may not have task affinity. Would this scenario require special care when using the FPU?
Ideally, it wouldn't require any extra steps: when any task first uses the FPU, FreeRTOS will pin it to the core it is currently running on. Then FreeRTOS will switch the FPU context between the tasks which use the FPU on the same core.
If the issue you are seeing is really caused by context switching, that would mean there is a bug in the context switching implementation somewhere. We do have some stress tests internally which verify that multiple tasks can do FPU calculations concurrently, but it is possible we haven't managed to reproduce the issue for some reason.
I think to make progress on this issue we would need to find a way to reproduce this at least somewhat reliably, then we can debug the issue.
@igrr Appreciate for your suggestion. I am trying to isolate the issue within the application scope but due to the randomness nature it takes long time to confirm if the stripped feature is problematic. Could you suggest some way to tell the state of the FPU or probe into the FPU context switching mechanism? Or could this be related to some hardware issue in my ESP32S3 module?
I tried to raise the priority to 99 on core 0 and I haven't observe any invalid result in 10 minutes. This might imply the context switch is to blame.
When the invalid value appears, after retry, the value becomes valid.
Because of the difficulty to produce a reliable minimum reproducible example and lack of related issue, I suspect this issue is likely specific to my application. I found that pinning the core was not sufficient to work around this issue. Instead, try increase the priority of the faulty thread, and the computation inconsistency will be gone.
Please observe that the problem is not related to scaler function at all. The root problem is "v" which is not computed correctly. As "v" is computed to "inf", then scaler(v) is inf.
Now, note that "recalculate" is "inf" too. Either the FPU failed 2 consecutive times to calculate "(float)el.value / (float)(1UL << el.gain)" or the compiler optimized the code and used "v" local variable instead on computing the same operation again.
It is interesting to check where "recalculate" comes from.
Hope this helps.
@eiffel31 Thanks for your suggestions! The recalculate
was just a repeated calculation of v
. I suspected that the computation error was random so I expect a repetition will give the correct answer. However, as seen in the output, the recalculated result was still wrong.
Actually, I observed both incorrect v
or incorrect scale
(as shown in the log) before, I am not quite sure if this problem was triggered by a specific arithmetic.
I understood that "recalculate" was an attempt for a repeated calculation of "v". But it is not because you wrote the same computation twice that the processor will compute it twice. The compiler may be smart enough to compute it only the first time and reuse the result for the second identical computation.
If the compiler optimized this way, then you failed to recompute and still have the same value. In order to know how code is generated, you probably have to look at the assembly generated code.
PS: In the log you provided, we can only observe incorrect "v". All other "inf" comes from the very same root: v is inf.
@eiffel31 Yes. I understand that the log was not showing all situations and I have to apologize for the very vague and verbose codes. This was because the problem was very hard to reproduce and didn't prepare all logs when posting.
Your suggestion of compiler optimization of the recalculate
was correct. It may not give sufficient proof to identify the issue. That's probably why, I found that if I put the calculation in a while-true loop for retrying, the problematic float arithcmetic was corrected after retry.
Great, but this may be a temporary hack until a new compiler release is smart enough to optimize out this calculation too.
You may try something else to check the optimization hypothesis: keep the code at it is in your bug report (I mean with the inf test and recompute), but declare el as volatile.
Another interesting test: capture el.value and el.gain in variable just before computing v. If v in inf, then show these values.
@eiffel31 Thanks for your suggestion. I also was suspecting if the memory accessing timing was to blame. However, a series stability issue in master
branch these days drove me to the latest release 5.0.1
, where I did not see the issue no matter what thread priority or core affinity were used. I realized that the master (5.1) and the release (5.0.1) are using different version of gcc (esp-12.2.0_20230208 and esp-2022r1-11.2.0 respectively), I am not sure if there is some issue in the new compiler toolchains.
This is really a good new, because a regression is by nature easier to fix that a very old problem that was never seen before. Nevertheless, is can be hard to fix!
For now, you have a workaround to run your code.
I have a quite similar issue recently with an assertion in fmtlib hits which checks for wrong signs of a float value. However in a previous function this value has been set with a non-negative value. And upon more checking it revealed that this negative value is the register content of another thread. This indicates that the FPU context switch is indeed not done correctly (in IDF v5.1-rc). @igrr @wuyuanyi135 @eiffel31
This issue is very hard to reproduce, sometimes I can trigger in 30s, sometimes repeated tests in 1 hour does not trigger this.
Right after one thread (R) prints:
R, 2219318006, 60:55:f9:f5:7f:b0:00:00, 38.438126
I process failed with an assertion that something is negative. However this value is from R thread's FPU register.
Partial log:
R, 2219316183, dc:54:75:c2:00:10:00:00, 45.34956
I, 2219316638, -488.4880, +867.1760, +2.1960, +0.00489, -0.00611, -0.00122
R, 2219318006, 60:55:f9:f5:7f:b0:00:00, 38.438126
abort() was called at PC 0x42029603 on core 0
0x42029603: __cxxabiv1::__terminate(void (*)()) at /Users/brnomac003/.gitlab-runner/builds/qR2TxTby/1/idf/crosstool-NG/.build/xtensa-esp32s3-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
Backtrace: 0x40375c2a:0x3fca2cd0 0x4037bf91:0x3fca2cf0 0x4038256e:0x3fca2d10 0x42029603:0x3fca2d80 0x42029638:0x3fca2da0 0x420085cf:0x3fca2dc0 0x4200e353:0x3fca2de0 0x42010cad:0x3fca2e90 0x420113c7:0x3fca3110 0x4201393b:0x3fca3280 0x4037e8cd:0x3fca3470
0x40375c2a: panic_abort at /Users/<redacted>/esp/esp-idf/components/esp_system/panic.c:452
0x4037bf91: esp_system_abort at /Users/<redacted>/esp/esp-idf/components/esp_system/port/esp_system_chip.c:84
0x4038256e: abort at /Users/<redacted>/esp/esp-idf/components/newlib/abort.c:38
0x42029603: __cxxabiv1::__terminate(void (*)()) at /Users/brnomac003/.gitlab-runner/builds/qR2TxTby/1/idf/crosstool-NG/.build/xtensa-esp32s3-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
0x42029638: std::terminate() at /Users/brnomac003/.gitlab-runner/builds/qR2TxTby/1/idf/crosstool-NG/.build/xtensa-esp32s3-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
0x420085cf: fmt::v10::detail::assert_fail(char const*, int, char const*) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format-inl.h:36
0x4200e353: int fmt::v10::detail::format_float<double>(double, int, fmt::v10::detail::float_specs, fmt::v10::detail::buffer<char>&) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:3494 (discriminator 1)
0x42010cad: std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::write_float<char, std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, float>(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, float, fmt::v10::format_specs<char>, fmt::v10::detail::locale_ref) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:3861 (discriminator 1)
0x420113c7: std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::write<char, std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, float, 0>(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, float, fmt::v10::format_specs<char>, fmt::v10::detail::locale_ref) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:3874
(inlined by) decltype (({parm#2}.out)()) fmt::v10::formatter<float, char, void>::format<fmt::v10::basic_format_context<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char> >(float const&, fmt::v10::basic_format_context<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char>&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:4723
(inlined by) decltype (({parm#2}.out)()) fmt::v10::formatter<fmt::v10::join_view<float*, float*, char>, char, void>::format<fmt::v10::basic_format_context<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char> >(fmt::v10::join_view<float*, float*, char> const&, fmt::v10::basic_format_context<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char>&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:4476
0x4201393b: std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:273
(inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:287
(inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
(inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
(inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
(inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::field<char, long long, 0>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
(inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::field<char, long long, 0>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > > > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
(inlined by) std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::v10::format<fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::field<char, long long, 0>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > > > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char>, char, 0>(fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::field<char, long long, 0>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > > > > const&, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:510
(inlined by) format<imu_task(void*)::<lambda()>::FMT_COMPILE_STRING, long long int, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> > at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:544
(inlined by) imu_task(void*) at /Users/<redacted>/project_fw/esp_fw/s3_fw/main/imu_task.cpp:258
0x4037e8cd: vPortTaskWrapper at /Users/<redacted>/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162
@NaivelyWritten Hi, thanks for report the similar issue. I closed this issue a while ago because I found that downgrading to 5.0.2 solved the issue. Recently, I gave it another try. The latest master
(5.2) did not reproduce the float point error anymore. However, my code had also changed quite a bit. I suspect this issue is mainly related to a corrupted stack or heap that caused the dirty value accidentally written by other tasks even though no concurrency exists. I enabled stack overflow detection and heap corruption detection (light impact) in my production environment since I didn't notice any performance issue at this moment.
Also, since this issue was raised, I have encountered many other weird crashes that pointed to nowhere helpful for debugging (like the memory allocator). They are also quite random (happens in minutes or days). Most probably, stack overflow or memory corruption are to blame, especially the code with heavy libraries such as fmtlib
being used.
Could you try enable the full memory protection and run your faulty code, and see if memory corruption was reported instead of the strange behaviors?
Hi @wuyuanyi135, I simply added a pair of vTaskSuspendAll()
and xTaskResumeAll()
calls and the issue is gone. I ran a stress test for 42 hours without any issue.
They are on the same core, so I still suspect bad context switch. I already tried (before posting) heap poisoning and stack canaries, nothing changed.
I am also seeing this exact issue on an ESP32-S3 with IDF Release v5.1. It occurs randomly as stated (my task where I last observed this happen was running on Core 0, but not pinned). I have multiple tasks running PID loops and reading analog I/Os. At first I assumed memory corruption and setup watchpoints before and after the double in RAM and haven't had them triggered while the issue occurred.
I ran additional tests over the last day or so:
The problem only occurs if I am using floats (hardware FPU). If I change all the floats to doubles (SW), the problem does NOT occur.
When running with floats, all the tasks using the float were automatically pinned to Core 0; as doubles, the tasks are now not pinned.
When configured as floats, even simple Float operations would randomly fail: ie. performing "isnan(somefloat)" could fail erroneously, and when performed immediately afterwards pass just fine.
@abellofiore Hello! Have you tried the master branch and see if the problem is fixed?
@abellofiore Hello! Have you tried the master branch and see if the problem is fixed?
No; the last time I upgraded to v5.0 it contained a bug that forced me to upgrade to v5.1; every time I upgrade I encounter a different issue.
@abellofiore This is fixed in issue #11690, try master
Answers checklist.
IDF version.
v5.1-dev-4692-g7bd0fd4abd
Operating System used.
Windows
How did you build your project?
Command line with Make
If you are using Windows, please specify command line type.
None
Development Kit.
ESP32-S3-WROOM-1-N16R2
Power Supply used.
USB
What is the expected behavior?
computing a valid float algorithm should result in a finite result.
What is the actual behavior?
I have a circular buffer containing data from an ADC and will process the buffered data in a separated thread.
However, the processing thread sometimes computed invalid
nan
orinf
result from the normal inputs.for example:
It triggers the errorous branch randomly, giving the following outputs
Note, the scaler is a linear mapping function adopted from etl. It computes the output by
float(((value - input_min_value) * multiplier)) + output_min_value;
The floating point arithmetics do not involve any invalid data or invalid operation, but somehow it mistakenly computed
inf
. Normally, the result should be equal toI made many attempts to find out the root of this random error but it turns out when creating this thread, if I pin the thread to core 0 by using
esp_pthread_cfg_t thread_cfg{4096, 1, true, "process", 0};
this problem will occur. If I pin the thread to core 1 byesp_pthread_cfg_t thread_cfg{4096, 1, true, "process", 1};
The problem is gone. Why would the core affinity affect the floating point calculation result? I know that the FPU is lazy switching, which potentially causes the inconsistency when dealing with threads. Is there any precaution I need to take to make sure the floating point result is valid and consistent?Steps to reproduce.
Debug Logs.
No response
More Information.
No response