Possible race in scheduler

lunde42 commented 1 year ago

The problem

I am seeing random freezes on my ESP32 running ESPhome.

Symptoms: The problem occurs occasionally, resulting in the device not responding or reacting to MQTT requests any more. However, a ping to the device still gets answered. After ~2 hours, the device resets itself, reboots and resumes operation. As visible in the log below, the scheduler issues the "Incrementing scheduler major" message, which should not happen as uptime is far below 50 days. I also found this issue, which may be related: https://github.com/esphome/issues/issues/2632

Possible cause: Looking at the scheduler, I see a possible race condition in Scheduler::millis_()

uint32_t Scheduler::millis_() {
  // This code is running in the context of task 1
  const uint32_t now = millis();
                          // Task 2 interrupts here
                          const uint32_t now = millis();
                          if (now < this->last_millis_) {
                            ESP_LOGD(TAG, "Incrementing scheduler major");
                            this->millis_major_++;
                          }
                          this->last_millis_ = now;  // last_millis is updated
                          return now;
  // back to task 1
  if (now < this->last_millis_) {  // condition is true, as task 2 increased last_millis
    ESP_LOGD(TAG, "Incrementing scheduler major");
    this->millis_major_++;  // erroneously updating millis_major
  }
  this->last_millis_ = now;
  return now;
}

Proposed solution: The code in millis() should be guarded by a mutex. I would suggest adding an extra lock to the class only for this purpose. As millis() is never called with the existing lock_ held, there should be no deadlock situations.

Question: Is a standard mutex safe for use in this situation? The millis_() function may be running in interrupt context. Should I rather use the freeRTOS xSemaphoreTakeFromISR() function? If yes, what happens if xSemaphoreTakeFromISR() is called in non-ISR context? Is this possible or is this another case that needs to be handled?

Looking forward to your comments.

T.

Which version of ESPHome has the issue?

2023.4.0

What type of installation are you using?

Docker

Which version of Home Assistant has the issue?

No response

What platform are you using?

ESP32

Board

Lolin32-lite

Component causing the issue

scheduler

Example YAML snippet

esphome:
  name: boiler
  on_boot:
    priority: -100
    then:
      - script.execute:
          id: boiler_set_level
          lvl: '0'
      - switch.turn_off: wdt
      - delay: 200ms
      - switch.turn_on: wdt

esp32:
  board: lolin_d32
  framework:
    type: arduino

# Enable logging
logger:
  level: DEBUG

globals:
  - id: g_level_cur
    type: int
    initial_value: '0'
    restore_value: no
  - id: g_level_last
    type: int
    initial_value: '0'
    restore_value: no
  - id: g_t_lim
    type: float
    initial_value: '50.0'
    restore_value: yes
  - id: g_t_cur
    type: float
    initial_value: '50.0'
    restore_value: yes

script:
  # dump internal values
  - id: dumpvals
    then:
      - mqtt.publish:
          topic: boiler/level/current
          payload: !lambda 'return std::to_string(id(g_level_cur));'
      - mqtt.publish:
          topic: boiler/level/lastset
          payload: !lambda 'return std::to_string(id(g_level_last));'
      - mqtt.publish:
          topic: boiler/temp/current
          payload: !lambda 'return std::to_string(id(g_t_cur));'
      - mqtt.publish:
          topic: boiler/temp/limit
          payload: !lambda 'return std::to_string(id(g_t_lim));'

  # set heater level
  - id: boiler_set_level
    mode: queued
    max_runs: 3
    parameters: 
      lvl: std::string
    then:
      - lambda: |-
          char* endp;
          long level;
          if (lvl == "") {
            // empty param -> keep last
            level = -1;
          } else {
            level = strtol(lvl.c_str(), &endp, 10);
            if (*endp) {
              // conversion failed -> keep last
              level = -1;
            }
          }
          if ((level > 4) || (level < -1)) {
            // invalid number -> keep last
            level = -1;
          }
          if (level >= 0) {
            // valid setting, remember it
            id(g_level_last) = level;
          } else {
            // keep last, retrieve curent level
            level = id(g_level_cur);
          }
          if (id(g_t_cur) >= id(g_t_lim)) {
            // temp limit reached, switch off
            id(mqtt_client).publish("boiler/msg", "WARNING! Temperature limit reached. Switching off.");
            level = 0;
          }

          // remember new level
          id(g_level_cur) = level;
          id(mqtt_client).publish("boiler/level/current", std::to_string(level));

          switch(level) {
          case 0:
            // off
            id(ch_1l).turn_off();
            id(ch_2l).turn_off();
            id(ch_3l).turn_off();
            id(ch_3n).turn_off();
            id(ch_n).turn_off();
            break;
          case 1:
            // 350W
            id(ch_2l).turn_off();
            id(ch_3l).turn_off();
            id(ch_n).turn_off();
            usleep(500000);
            id(ch_3n).turn_on();
            id(ch_1l).turn_on();
            break;
          case 2:
            // 700W
            id(ch_2l).turn_off();
            id(ch_3l).turn_off();
            id(ch_n).turn_on();
            id(ch_3n).turn_off();
            id(ch_1l).turn_on();
            break;
          case 3:
            // 1400W
            id(ch_3l).turn_off();
            id(ch_n).turn_on();
            id(ch_3n).turn_off();
            id(ch_1l).turn_on();
            id(ch_2l).turn_on();
            break;
          case 4:
            // 2000W
            id(ch_n).turn_on();
            id(ch_3n).turn_off();
            usleep(500000);
            id(ch_1l).turn_on();
            id(ch_2l).turn_on();
            id(ch_3l).turn_on();
            break;
          default:
            break;
          }

      - script.execute: boiler_wdt
      # Retrigger h/w WDT
      - switch.turn_off: wdt
      - delay: 200ms
      - switch.turn_on: wdt
        #      - script.execute: dumpvals

  - id: boiler_wdt
    mode: restart
    then:
      - delay: 60s
      - mqtt.publish:
          topic: boiler/msg
          payload: "ERROR! Communication timeout. Switched heater off."
      - script.execute:
          id: boiler_set_level
          lvl: '0'

mqtt:
  id: mqtt_client
  broker: zaphod.disc.world

  on_message:
    - topic: boiler/level/set
      then:
        - lambda: |-
            id(boiler_set_level)->execute(x);

    - topic: boiler/temp/limit/set
      then:
        - globals.set:
            id: g_t_lim
            value: !lambda |-
              return (std::strtol(x.c_str(), NULL, 10));
        - lambda: |-
            id(boiler_set_level)->execute("");

    - topic: boiler/temp/current/set
      then:
        - globals.set:
            id: g_t_cur
            value: !lambda |-
              return (std::strtol(x.c_str(), NULL, 10));
        - lambda: |-
            id(boiler_set_level)->execute("");

    - topic: boiler/dumpvals
      then:
        - script.execute:
            id: dumpvals

# Enable Home Assistant API
#api:
#  password: ""

ota:
  password: ""

wifi:
  ssid: "discworld"
  password: "***************"

  # Enable fallback hotspot (captive portal) in case wifi connection fails
  ap:
    ssid: "Receiver Fallback Hotspot"
    password: "*************"

captive_portal:

web_server:
  local: true

switch:
  - platform: gpio
    pin: GPIO27
    id: ch_1l

  - platform: gpio
    pin: GPIO4
    id: ch_2l

  - platform: gpio
    pin: GPIO18
    id: ch_3l
    interlock: [ch_3n]
    interlock_wait_time: 500ms

  - platform: gpio
    pin: GPIO23
    id: ch_3n
    interlock: [ch_3l]
    interlock_wait_time: 500ms

  - platform: gpio
    pin: GPIO19
    id: ch_n

  - platform: gpio
    pin: GPIO16
    id: wdt

Anything in the logs that might be useful for us?

Sep 22 16:50:08 boiler/debug ^[[0;36m[D][switch:012]: 'wdt' Turning ON.^[[0m
Sep 22 16:50:08 boiler/debug ^[[0;36m[D][switch:055]: 'wdt': Sending state ON^[[0m
Sep 22 16:50:17 boiler/level/set 0
Sep 22 16:50:17 boiler/debug ^[[0;36m[D][scheduler:328]: Incrementing scheduler major^[[0m
Sep 22 16:50:17 boiler/level/current 0
Sep 22 16:50:17 boiler/debug ^[[0;36m[D][switch:016]: 'ch_1l' Turning OFF.^[[0m
Sep 22 16:50:17 boiler/debug ^[[0;36m[D][switch:016]: 'ch_2l' Turning OFF.^[[0m
Sep 22 16:50:17 boiler/debug ^[[0;36m[D][switch:016]: 'ch_3l' Turning OFF.^[[0m
Sep 22 16:50:17 boiler/debug ^[[0;36m[D][switch:016]: 'ch_3n' Turning OFF.^[[0m
Sep 22 16:50:17 boiler/debug ^[[0;36m[D][switch:016]: 'ch_n' Turning OFF.^[[0m
Sep 22 16:50:17 boiler/debug ^[[0;36m[D][script:077]: Script 'boiler_wdt' restarting (mode: restart)^[[0m
Sep 22 16:50:17 boiler/debug ^[[0;36m[D][switch:016]: 'wdt' Turning OFF.^[[0m
Sep 22 16:50:17 boiler/debug ^[[0;36m[D][switch:055]: 'wdt': Sending state OFF^[[0m
Sep 22 16:50:27 boiler/level/set 0
Sep 22 16:50:37 boiler/level/set 0

Additional information

No response

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

DAVe3283 commented 2 weeks ago

I agree that there appears to be a race condition around the millis overflow, but am not able to answer your question about which mutex type is best/safe here.

I wonder if this is related to #5375 as well? Devices appear to sometimes crash right after Incrementing scheduler major.

esphome / issues