espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.71k stars 7.3k forks source link

ESP32-S3 float point incorrectly computed `nan` or `inf` when pinned to a core (IDFGH-9930) #11225

Closed wuyuanyi135 closed 1 year ago

wuyuanyi135 commented 1 year ago

Answers checklist.

IDF version.

v5.1-dev-4692-g7bd0fd4abd

Operating System used.

Windows

How did you build your project?

Command line with Make

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-S3-WROOM-1-N16R2

Power Supply used.

USB

What is the expected behavior?

computing a valid float algorithm should result in a finite result.

What is the actual behavior?

I have a circular buffer containing data from an ADC and will process the buffered data in a separated thread.

However, the processing thread sometimes computed invalid nan or inf result from the normal inputs.

for example:

    for (int i = 0; i < n; ++i) {
        const auto& el = ads1247.data_buffer[i];
        auto v = (float)el.value / (float)(1UL << el.gain);
        data[i] = v;

        auto scaled = scaler(v) / (float)led_level; 
        if (!std::isfinite(scaled)) {
          float recalculate = (float)el.value / (float)(1UL << el.gain);
          LE("Invalid data @{}: scaled={}, v={}, value = {}, gain={}, recalculate={}", i, scaled, v, el.value, el.gain, recalculate);
          LE("data_buffer contents:");
          for (int j = 0; j < n; ++j) {
            const auto& d = ads1247.data_buffer[j];
            LE("Item [{}]: gain={}\tvalue={}\traw={}", j, d.gain, d.value, *reinterpret_cast<const int*>(&d.value));
          }
          LE("(float)el.value={}", (float)el.value);
          LE("(float)(1UL << el.gain)={}", (float)(1UL << el.gain));
          LE("v={}", (float)el.value / (float)(1UL << el.gain));
          LE("scaler(v)={}", scaler(v));
          LE("scaler(v) / (float)led_level={}", scaler(v) / (float)led_level);
          LE("scaler.multiplier={}", scaler.multiplier);
          LE("scaler.output_max_value={}", scaler.output_max_value);
          LE("scaler.output_min_value={}", scaler.output_min_value);
          LE("scaler.input_min_value={}", scaler.input_min_value);
          abort();
        } else if (i == n - 1) {
          LI("Good data: scaled={}, v={}, value = {}, gain={}, recalculate={}", scaled, v, el.value, el.gain, (float)el.value / (float)(1UL << el.gain));
        }
      }

It triggers the errorous branch randomly, giving the following outputs

E (14:55:28.576) turbidity: Invalid data @77: scaled=inf, v=inf, value = 5086020, gain=6, recalculate=inf
E (14:55:28.577) turbidity: data_buffer contents:
****Item output removed for simplicity*****
E (14:55:28.607) turbidity: (float)el.value=5086020
E (14:55:28.607) turbidity: (float)(1UL << el.gain)=64
E (14:55:28.607) turbidity: v=79469.06
E (14:55:28.607) turbidity: scaler(v)=inf
E (14:55:28.608) turbidity: scaler(v) / (float)led_level=inf
E (14:55:28.608) turbidity: scaler.multiplier=0.10319918
E (14:55:28.608) turbidity: scaler.output_max_value=100
E (14:55:28.609) turbidity: scaler.output_min_value=0
E (14:55:28.609) turbidity: scaler.input_min_value=500

Note, the scaler is a linear mapping function adopted from etl. It computes the output by float(((value - input_min_value) * multiplier)) + output_min_value;

The floating point arithmetics do not involve any invalid data or invalid operation, but somehow it mistakenly computed inf. Normally, the result should be equal to

v = 79469.06
scaler(v) = ((79469.06 - 500) * 0.10319918) + 0 = 8149.54223737

I made many attempts to find out the root of this random error but it turns out when creating this thread, if I pin the thread to core 0 by using esp_pthread_cfg_t thread_cfg{4096, 1, true, "process", 0}; this problem will occur. If I pin the thread to core 1 by esp_pthread_cfg_t thread_cfg{4096, 1, true, "process", 1}; The problem is gone. Why would the core affinity affect the floating point calculation result? I know that the FPU is lazy switching, which potentially causes the inconsistency when dealing with threads. Is there any precaution I need to take to make sure the floating point result is valid and consistent?

Steps to reproduce.

  1. The problem is a bit too random to reproduce

Debug Logs.

No response

More Information.

No response

igrr commented 1 year ago

@wuyuanyi135 do you have any other tasks in your application which use floating-point arithmetic? If yes, do these tasks run on core 0 or core 1? And, just to confirm, you aren't doing any floating-point arithmetic in ISRs?

wuyuanyi135 commented 1 year ago

@igrr Thanks for your response. No, I do not have floating point arithmetic in ISRs. I do have many other tasks running and they do use the floating point computation. This process task has priority of 1 so I guess preemption would occur constantly during data processing. Most tasks were pinned to core 0. But some tasks may not have task affinity. Would this scenario require special care when using the FPU?

igrr commented 1 year ago

Ideally, it wouldn't require any extra steps: when any task first uses the FPU, FreeRTOS will pin it to the core it is currently running on. Then FreeRTOS will switch the FPU context between the tasks which use the FPU on the same core.

If the issue you are seeing is really caused by context switching, that would mean there is a bug in the context switching implementation somewhere. We do have some stress tests internally which verify that multiple tasks can do FPU calculations concurrently, but it is possible we haven't managed to reproduce the issue for some reason.

I think to make progress on this issue we would need to find a way to reproduce this at least somewhat reliably, then we can debug the issue.

wuyuanyi135 commented 1 year ago

@igrr Appreciate for your suggestion. I am trying to isolate the issue within the application scope but due to the randomness nature it takes long time to confirm if the stripped feature is problematic. Could you suggest some way to tell the state of the FPU or probe into the FPU context switching mechanism? Or could this be related to some hardware issue in my ESP32S3 module?

wuyuanyi135 commented 1 year ago

I tried to raise the priority to 99 on core 0 and I haven't observe any invalid result in 10 minutes. This might imply the context switch is to blame.

wuyuanyi135 commented 1 year ago

When the invalid value appears, after retry, the value becomes valid.

wuyuanyi135 commented 1 year ago

Because of the difficulty to produce a reliable minimum reproducible example and lack of related issue, I suspect this issue is likely specific to my application. I found that pinning the core was not sufficient to work around this issue. Instead, try increase the priority of the faulty thread, and the computation inconsistency will be gone.

eiffel31 commented 1 year ago

Please observe that the problem is not related to scaler function at all. The root problem is "v" which is not computed correctly. As "v" is computed to "inf", then scaler(v) is inf.

Now, note that "recalculate" is "inf" too. Either the FPU failed 2 consecutive times to calculate "(float)el.value / (float)(1UL << el.gain)" or the compiler optimized the code and used "v" local variable instead on computing the same operation again.

It is interesting to check where "recalculate" comes from.

Hope this helps.

wuyuanyi135 commented 1 year ago

@eiffel31 Thanks for your suggestions! The recalculate was just a repeated calculation of v. I suspected that the computation error was random so I expect a repetition will give the correct answer. However, as seen in the output, the recalculated result was still wrong.

Actually, I observed both incorrect v or incorrect scale (as shown in the log) before, I am not quite sure if this problem was triggered by a specific arithmetic.

eiffel31 commented 1 year ago

I understood that "recalculate" was an attempt for a repeated calculation of "v". But it is not because you wrote the same computation twice that the processor will compute it twice. The compiler may be smart enough to compute it only the first time and reuse the result for the second identical computation.

If the compiler optimized this way, then you failed to recompute and still have the same value. In order to know how code is generated, you probably have to look at the assembly generated code.

PS: In the log you provided, we can only observe incorrect "v". All other "inf" comes from the very same root: v is inf.

wuyuanyi135 commented 1 year ago

@eiffel31 Yes. I understand that the log was not showing all situations and I have to apologize for the very vague and verbose codes. This was because the problem was very hard to reproduce and didn't prepare all logs when posting.

Your suggestion of compiler optimization of the recalculate was correct. It may not give sufficient proof to identify the issue. That's probably why, I found that if I put the calculation in a while-true loop for retrying, the problematic float arithcmetic was corrected after retry.

eiffel31 commented 1 year ago

Great, but this may be a temporary hack until a new compiler release is smart enough to optimize out this calculation too.

You may try something else to check the optimization hypothesis: keep the code at it is in your bug report (I mean with the inf test and recompute), but declare el as volatile.

Another interesting test: capture el.value and el.gain in variable just before computing v. If v in inf, then show these values.

wuyuanyi135 commented 1 year ago

@eiffel31 Thanks for your suggestion. I also was suspecting if the memory accessing timing was to blame. However, a series stability issue in master branch these days drove me to the latest release 5.0.1, where I did not see the issue no matter what thread priority or core affinity were used. I realized that the master (5.1) and the release (5.0.1) are using different version of gcc (esp-12.2.0_20230208 and esp-2022r1-11.2.0 respectively), I am not sure if there is some issue in the new compiler toolchains.

eiffel31 commented 1 year ago

This is really a good new, because a regression is by nature easier to fix that a very old problem that was never seen before. Nevertheless, is can be hard to fix!

For now, you have a workaround to run your code.

NaivelyWritten commented 1 year ago

I have a quite similar issue recently with an assertion in fmtlib hits which checks for wrong signs of a float value. However in a previous function this value has been set with a non-negative value. And upon more checking it revealed that this negative value is the register content of another thread. This indicates that the FPU context switch is indeed not done correctly (in IDF v5.1-rc). @igrr @wuyuanyi135 @eiffel31

This issue is very hard to reproduce, sometimes I can trigger in 30s, sometimes repeated tests in 1 hour does not trigger this.

Right after one thread (R) prints:

R, 2219318006, 60:55:f9:f5:7f:b0:00:00, 38.438126

I process failed with an assertion that something is negative. However this value is from R thread's FPU register.

Partial log:

R, 2219316183, dc:54:75:c2:00:10:00:00, 45.34956
I, 2219316638,    -488.4880,    +867.1760,      +2.1960,   +0.00489,   -0.00611,   -0.00122
R, 2219318006, 60:55:f9:f5:7f:b0:00:00, 38.438126

abort() was called at PC 0x42029603 on core 0
0x42029603: __cxxabiv1::__terminate(void (*)()) at /Users/brnomac003/.gitlab-runner/builds/qR2TxTby/1/idf/crosstool-NG/.build/xtensa-esp32s3-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48

Backtrace: 0x40375c2a:0x3fca2cd0 0x4037bf91:0x3fca2cf0 0x4038256e:0x3fca2d10 0x42029603:0x3fca2d80 0x42029638:0x3fca2da0 0x420085cf:0x3fca2dc0 0x4200e353:0x3fca2de0 0x42010cad:0x3fca2e90 0x420113c7:0x3fca3110 0x4201393b:0x3fca3280 0x4037e8cd:0x3fca3470
0x40375c2a: panic_abort at /Users/<redacted>/esp/esp-idf/components/esp_system/panic.c:452

0x4037bf91: esp_system_abort at /Users/<redacted>/esp/esp-idf/components/esp_system/port/esp_system_chip.c:84

0x4038256e: abort at /Users/<redacted>/esp/esp-idf/components/newlib/abort.c:38

0x42029603: __cxxabiv1::__terminate(void (*)()) at /Users/brnomac003/.gitlab-runner/builds/qR2TxTby/1/idf/crosstool-NG/.build/xtensa-esp32s3-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48

0x42029638: std::terminate() at /Users/brnomac003/.gitlab-runner/builds/qR2TxTby/1/idf/crosstool-NG/.build/xtensa-esp32s3-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58

0x420085cf: fmt::v10::detail::assert_fail(char const*, int, char const*) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format-inl.h:36

0x4200e353: int fmt::v10::detail::format_float<double>(double, int, fmt::v10::detail::float_specs, fmt::v10::detail::buffer<char>&) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:3494 (discriminator 1)

0x42010cad: std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::write_float<char, std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, float>(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, float, fmt::v10::format_specs<char>, fmt::v10::detail::locale_ref) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:3861 (discriminator 1)

0x420113c7: std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::write<char, std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, float, 0>(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, float, fmt::v10::format_specs<char>, fmt::v10::detail::locale_ref) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:3874
 (inlined by) decltype (({parm#2}.out)()) fmt::v10::formatter<float, char, void>::format<fmt::v10::basic_format_context<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char> >(float const&, fmt::v10::basic_format_context<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char>&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:4723
 (inlined by) decltype (({parm#2}.out)()) fmt::v10::formatter<fmt::v10::join_view<float*, float*, char>, char, void>::format<fmt::v10::basic_format_context<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char> >(fmt::v10::join_view<float*, float*, char> const&, fmt::v10::basic_format_context<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char>&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/format.h:4476

0x4201393b: std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:273
 (inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:287
 (inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
 (inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
 (inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
 (inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::field<char, long long, 0>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
 (inlined by) std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::field<char, long long, 0>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > > > >::format<std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> >(std::back_insert_iterator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) const at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:288
 (inlined by) std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > fmt::v10::format<fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::field<char, long long, 0>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > > > >, long long, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char>, char, 0>(fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::field<char, long long, 0>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 1>, fmt::v10::detail::concat<fmt::v10::detail::text<char>, fmt::v10::detail::concat<fmt::v10::detail::spec_field<char, fmt::v10::join_view<float*, float*, char>, 2>, fmt::v10::detail::code_unit<char> > > > > > > const&, long long const&, fmt::v10::join_view<float*, float*, char> const&, fmt::v10::join_view<float*, float*, char> const&) at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:510
 (inlined by) format<imu_task(void*)::<lambda()>::FMT_COMPILE_STRING, long long int, fmt::v10::join_view<float*, float*, char>, fmt::v10::join_view<float*, float*, char> > at /Users/<redacted>/project_fw/esp_fw/s3_fw/cxxsp_component/fmtlib/fmt/include/fmt/compile.h:544
 (inlined by) imu_task(void*) at /Users/<redacted>/project_fw/esp_fw/s3_fw/main/imu_task.cpp:258

0x4037e8cd: vPortTaskWrapper at /Users/<redacted>/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162
wuyuanyi135 commented 1 year ago

@NaivelyWritten Hi, thanks for report the similar issue. I closed this issue a while ago because I found that downgrading to 5.0.2 solved the issue. Recently, I gave it another try. The latest master (5.2) did not reproduce the float point error anymore. However, my code had also changed quite a bit. I suspect this issue is mainly related to a corrupted stack or heap that caused the dirty value accidentally written by other tasks even though no concurrency exists. I enabled stack overflow detection and heap corruption detection (light impact) in my production environment since I didn't notice any performance issue at this moment.

Also, since this issue was raised, I have encountered many other weird crashes that pointed to nowhere helpful for debugging (like the memory allocator). They are also quite random (happens in minutes or days). Most probably, stack overflow or memory corruption are to blame, especially the code with heavy libraries such as fmtlib being used.

Could you try enable the full memory protection and run your faulty code, and see if memory corruption was reported instead of the strange behaviors?

NaivelyWritten commented 1 year ago

Hi @wuyuanyi135, I simply added a pair of vTaskSuspendAll() and xTaskResumeAll() calls and the issue is gone. I ran a stress test for 42 hours without any issue.

image
NaivelyWritten commented 1 year ago

They are on the same core, so I still suspect bad context switch. I already tried (before posting) heap poisoning and stack canaries, nothing changed.

abellofiore commented 1 year ago

I am also seeing this exact issue on an ESP32-S3 with IDF Release v5.1. It occurs randomly as stated (my task where I last observed this happen was running on Core 0, but not pinned). I have multiple tasks running PID loops and reading analog I/Os. At first I assumed memory corruption and setup watchpoints before and after the double in RAM and haven't had them triggered while the issue occurred.

abellofiore commented 1 year ago

I ran additional tests over the last day or so:

The problem only occurs if I am using floats (hardware FPU). If I change all the floats to doubles (SW), the problem does NOT occur.

When running with floats, all the tasks using the float were automatically pinned to Core 0; as doubles, the tasks are now not pinned.

When configured as floats, even simple Float operations would randomly fail: ie. performing "isnan(somefloat)" could fail erroneously, and when performed immediately afterwards pass just fine.

wuyuanyi135 commented 1 year ago

@abellofiore Hello! Have you tried the master branch and see if the problem is fixed?

abellofiore commented 1 year ago

@abellofiore Hello! Have you tried the master branch and see if the problem is fixed?

No; the last time I upgraded to v5.0 it contained a bug that forced me to upgrade to v5.1; every time I upgrade I encounter a different issue.

ProfFan commented 1 year ago

@abellofiore This is fixed in issue #11690, try master