[BUG] esp32 pid controller + psram issue

yamt commented 1 month ago

Description / Steps to reproduce the issue

see the following test code:

#include <nuttx/sched.h>

void print_stackinfo(const char *label, const struct stackinfo_s *p) {
  printf("%s: %p: %zu %p %p\n", label, p, p->adj_stack_size, p->stack_alloc_ptr,
         p->stack_base_ptr);
}

void test(void) {
  int i;

  struct stackinfo_s s;
  nxsched_get_stackinfo(0, &s);
  for (i = 0; i < 4; i++) {
    print_stackinfo("iram", &s);
  }

  struct stackinfo_s *psram = (void *)SOC_EXTRAM_DATA_LOW;
  nxsched_get_stackinfo(0, psram);
  for (i = 0;; i++) {
    print_stackinfo("psram", psram);
    if (psram->stack_alloc_ptr == s.stack_alloc_ptr) {
      break;
    }
  }
}

run it with a configuration where:

esp32 protected mode
userland pid is 5 (ie. CONFIG_ESP32_USER_DATA_EXTMEM is not set)
psram is initialized but not registered to any nuttx system heaps. (so that user apps can use them directly)

for some reasons, it prints garbage-looking values on the first run as the following.

nsh> esp32_pid
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(psram): 0x3f800000: 1145050453 0x5d755d51 0x3ffd9b54
print_stackinfo(psram): 0x3f800000: 1145050453 0x5d755d51 0x3ffd9b54
print_stackinfo(psram): 0x3f800000: 1145050453 0x5d755d51 0x3ffd9b54
print_stackinfo(psram): 0x3f800000: 1145050453 0x5d755d51 0x3ffd9b54
nsh> esp32_pid
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(psram): 0x3f800000: 2000 0x3ffc9b40 0x3ffc9b70
nsh> esp32_pid
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(iram): 0x3ffca290: 2000 0x3ffc9b40 0x3ffc9b70
print_stackinfo(psram): 0x3f800000: 2000 0x3ffc9b40 0x3ffc9b70
nsh>

my understanding is that

nxsched_get_stackinfo system call (the kernel, which runs with pid 0) wrote the values to the buffer (psram)
the thread returned to the userspace. (thus switch its pid back to 5)
the userspace read the buffer immediately and got the wrong values for some reasons
for some reasons, after a while (5th iteration of the loop in the following example), it started reading the correct values.

maybe pid controller takes longer to get ready than usual?

i suspect this can explain the problem which forced us to use pid 1 for CONFIG_ESP32_USER_DATA_EXTMEM in the first place. i wonder if there is a reasonable way to wait for the pid controller (or maybe mmu/cache? i dunno) ready after a pid switch.

On which OS does this issue occur?

[OS: Mac]

What is the version of your OS?

macos

NuttX Version

modified master

Issue Architecture

[Arch: xtensa]

Issue Area

[Area: Specific Peripheral]

Verification

[X] I have verified before submitting the report.

yamt commented 1 month ago

i wonder if there is a reasonable way to wait for the pid controller (or maybe mmu/cache? i dunno) ready after a pid switch.

@gustavonihei @tmedicci do you have any idea?

rftafas commented 1 month ago

ESP32 PID Controller is (or should be) experimental and must not be used for production.

You can only use PID1 and 0 only. They have the same privilege, so ESP32 doesn't actually provide any protection and it is just for 'testing' the approach of protected mode.

If you actually need this, you should use ESP32-S2 or, better, ESP32-S3.

Further insights:

To be more specific, there are 8 PIDs: Privileged (0-1) and Unprivileged (2-7).
On the same CPU, you can't switch between these 2 groups.
PROCPU always starts with PID0, so it ends being forced to never run unprivileged PIDs.
To use PIDs2-7, you will have to make sure APP CPU is ALWAYS running PIDs 2-7.

IMO, I would test this as it is (experimental, PID0-1) and if necessary to use on a product, move to ESP32S3.

yamt commented 1 month ago

ESP32 PID Controller is (or should be) experimental and must not be used for production.

why? is there a known hardware bug or something? do you have a reference?

You can only use PID1 and 0 only. They have the same privilege, so ESP32 doesn't actually provide any protection and it is just for 'testing' the approach of protected mode.

If you actually need this, you should use ESP32-S2 or, better, ESP32-S3.

Further insights:
* To be more specific, there are 8 PIDs: Privileged (0-1) and Unprivileged (2-7).

* On the same CPU, you can't switch between these 2 groups.

why not?

* PROCPU always starts with PID0, so it ends being forced to never run unprivileged PIDs.

* To use PIDs2-7, you will have to make sure APP CPU is ALWAYS running PIDs 2-7.
IMO, I would test this as it is (experimental, PID0-1) and if necessary to use on a product, move to ESP32S3.

rftafas commented 1 month ago

why?

First answer is that it is not implemented. Second is that we won't work on it past experimental. You'll be on your own if you really need it on ESP32. It is not a bug, it was not implemented.

is there a known hardware bug or something?

Not a bug, but its implementations is more suitable for AMP scenarios and/or a bare metal hypervisor approach. For regular, single kernel on multicore approach, it becomes quite hard to use it.

do you have a reference?

Datasheet. My opinion also comes from evaluating it as a Project: too big, little to gain, something only valid for ESP32 as others are different.

why not?

Ok, I oversimplified. The point is you can't simply switch PIDs. You must follow the steps on how to make it work (on datasheet). It is currently not implemented and there are some caveats on how NuttX is and what would need change to make that work. It get more complex when you throw multicore SMP in the game. Cherry on top is the IRQ control and all the changes it will require.

yamt commented 1 month ago

why?

First answer is that it is not implemented. Second is that we won't work on it past experimental. You'll be on your own if you really need it on ESP32. It is not a bug, it was not implemented.

do you mean not implemented in nuttx? i'm interested in implementing it. i'm not asking you to implement it. :-)

is there a known hardware bug or something?

Not a bug, but its implementations is more suitable for AMP scenarios and/or a bare metal hypervisor approach. For regular, single kernel on multicore approach, it becomes quite hard to use it.

i wanted to use it for something specific to my applications. it can be different from generic nuttx protected model semantics.

do you have a reference?

Datasheet. My opinion also comes from evaluating it as a Project: too big, little to gain, something only valid for ESP32 as others are different.

do you mean the TRM? otherwise, can you give me an URL?

why not?

Ok, I oversimplified. The point is you can't simply switch PIDs. You must follow the steps on how to make it work (on datasheet). It is currently not implemented and there are some caveats on how NuttX is and what would need change to make that work. It get more complex when you throw multicore SMP in the game. Cherry on top is the IRQ control and all the changes it will require.

i wanted to know the steps to follow. however, i couldn't find it in the TRM. that's why i asked it here and there.

rftafas commented 1 month ago

ESP32 TRM v5.2 chap 28.3.3

yamt commented 1 month ago

ESP32 TRM v5.2 chap 28.3.3

thank you. isn't it basically what's already implemented in get_prev_pid/set_next_pid and exception_entry_hook/exception_exit_hook?

rftafas commented 1 month ago

thank you. isn't it basically what's already implemented in get_prev_pid/set_next_pid and exception_entry_hook/exception_exit_hook?

Clearly something is missing, as it is not working outside the scope I mentioned. You should check chapter 27, as PID controller also affects memory.

Unfortunatelly, I can't be of more help than this and sorry for not pointing you to a better solution. But if I can offer one last advice is to evaluate the implications of chapter 28. By doing it we concluded the effort was too big. Good luck!

apache / nuttx