esphome / issues

Issue Tracker for ESPHome
https://esphome.io/
290 stars 35 forks source link

remote_transmitter crashes with cc1101 #2811

Closed dbuezas closed 2 years ago

dbuezas commented 2 years ago

The problem

If a signal is received during a transmission, the board crashes and restarts. This is particularly likely in my setup since I'm sending and receiving through the same pin (so the board listens to its own transmission).

The instability goes away if I add a lambda with noInterrupts(); before transmitting and then reenable them interrupts(); after transmitting, therefore:

IMPORTANT PART:

I'm quite convinced the issues come from the removal of the InterruptLock lock; line 2 versions ago here: https://github.com/esphome/esphome/pull/2476/files#diff-52dbad1a2913b8597dec15f3f5da39c37d486d94149220801ef3cca11b41638fL71

Which version of ESPHome has the issue?

2021.12.0

What type of installation are you using?

Home Assistant Add-on

Which version of Home Assistant has the issue?

2021.11

What platform are you using?

ESP8266

Board

d1-mini

Component causing the issue

remote_transmitter

Example YAML snippet

custom_component:
  - lambda: |-
      // CC1101Transciver(int SCK, int MISO, int MOSI, int CSN, int GDO0, float bandwidth, float freq)
      auto transciver_433_92 = new CC1101Transciver(D5, D6, D7, D3, D1, 158.03, 433.92);
      return {transciver_433_92};
    components:
      - id: transciver_433_92
remote_transmitter:
  - id: transmitter_rc
    pin: D1
    carrier_duty_percent: 100%

remote_receiver:
  - id: receiver_rc
    pin:
      number: D1
    dump:
      - raw
    tolerance: 50%
    filter: 35us
    idle: 2ms
    buffer_size: 1kb

binary_sensor:
  ########## rc1 ###########
  - platform: remote_receiver
    receiver_id: receiver_rc
    id: rc1_on
    name: "rc1 button on"
    filters:
      - delayed_off: 100ms
    on_release:
      - switch.turn_on: rc1_switch
    raw:
      code: [166,-500,...,-166,166]

  - platform: remote_receiver
    receiver_id: receiver_rc
    id: rc1_off
    name: "rc1 button off"
    filters:
      - delayed_off: 100ms
    on_release:
      - switch.turn_off: rc1_switch
    raw:
      code: [166,-500,...,-500,166]

switch:
  - platform: template
    name: rc1 Switch
    id: rc1_switch
    optimistic: true
    turn_on_action:
      - lambda: |-
          ((CC1101Transciver *)transciver_433_92)->beginTransmission();
      - switch.turn_on: s_led
      - remote_transmitter.transmit_raw:
          transmitter_id: transmitter_rc
          code: [166,-500,...,-500,166]
          repeat:
            times: 10
            wait_time: 4.733ms
      - lambda: |-
          ((CC1101Transciver *)transciver_433_92)->endTransmission();
      - switch.turn_off: s_led
    turn_off_action:
      - lambda: |-
          ((CC1101Transciver *)transciver_433_92)->beginTransmission();
      - switch.turn_on: s_led
      - remote_transmitter.transmit_raw:
          transmitter_id: transmitter_rc
          code: [166,-500,...,-500,166]
          repeat:
            times: 10
            wait_time: 4.733ms
      - lambda: |-
          ((CC1101Transciver *)transciver_433_92)->endTransmission();
      - switch.turn_off: s_led

Anything in the logs that might be useful for us?

No response

Additional information

~An easy way to replicate this without hardware should be:~ ~1. remote_receiver on any pin~ ~2. remote_transmitter on any other~ ~3. short those pins~ ~4. transmit a couple of times.~ ~It should crash and reset~ (update: this doesn't reproduce the issue)

my own setup:

I'm doing this: https://github.com/esphome/feature-requests/issues/59#issuecomment-974388062

BTW: I'd love to contribute with a component specific to this chip, (it allows to change the frequency and it has good range), but I didn't find a way to use the remote_receiver and remote_transmitter internally yet. If somebody has time to guide me a bit, I'll be glad to give it a try.

dbuezas commented 2 years ago

@CarlosGS, since this is your PR, I think you'll be interested in this. The new timings are super accurate btw, great work!, I had indeed accounted for the wrong timings in my codes.

CarlosGS commented 2 years ago

Hi David, thanks for letting me know. Not sure of the main cause though:

An easy way to replicate this without hardware should be:

  1. remote_receiver on any pin
  2. remote_transmitter on any other
  3. short those pins
  4. transmit a couple of times. It should crash and reset

I did test for this pretty thoroughly (see the 2 test codes in https://github.com/esphome/esphome/pull/2476#issuecomment-952188362), so the freeze cannot -should not- be replicated with this simple example. However, please try for yourself so we can be sure this is not the cause.

It could also be the use of the same GPIO pin for a receiver and a transmitter. If you try the above example please make sure to test this specific case.

Also, not sure of the workings of the CC1101Transciver custom component. So if you can find a minimal example that does replicate the bug that would help a lot. Cheers!

dbuezas commented 2 years ago

I just tried the two minimal examples I proposed earlier (receive pin = transmit pin and bridge pins externally) and you are right: it does not crash as I expected.

The custom component just wraps a library and keeps track of some variables. Here's what I call before and after each transmission:

  void beginTransmission(){
    // noInterrupts(); <--- with this it works ok
    ELECHOUSE_cc1101.setModul(_moduleNumber);
    ELECHOUSE_cc1101.SetTx();
    pinMode(_GDO0, OUTPUT);
  }
  void endTransmission(){
    pinMode(_GDO0, INPUT);
    ELECHOUSE_cc1101.setModul(_moduleNumber);
    ELECHOUSE_cc1101.SetRx();
    ELECHOUSE_cc1101.SetRx(); // yes, twice
    // interrupts(); <--- with this it works ok
  }

In any case, if I add the disable/enable interrupt lines it works perfectly so I assume the removed InterruptLock lock; here is what broke it for me. Was there any specific reason to remove it?

CarlosGS commented 2 years ago

Thanks for the update, let's see: the interrupt lock was in place for the old remote_transmitter implementation, because it used raw delays without considering the execution of the code itself. Hence interrupts would cause inconsistent waits providing less accuracy. However, the new approach uses relative timing, so interleaved interrupts don't affect it, and the lock is not needed anymore.

It is better to keep interrupts active because there are other ESPHome modules that use them, and they all need to work side by side. For example, PWM stops working while interrupts are disabled, and also sensors that use gpio interruptions can miss events and have less accuracy, etc. There's also ESP32 which doesn't block while sending, so the receive-while-send behavior is consistent.

I've looked into your custom component & the library, I can only see a possible problem with recursion in that you are receiving a command, that triggers a new sending, that can then trigger again, etc. Other than that, I can't find anything obvious. If you have the time, it would be nice to methodically remove parts to see what is the minimal code that can replicate the problem. Cheers! :ok_hand:

dbuezas commented 2 years ago

I understand now why it is gone, thanks for the detailed explanation. BTW, with the pins bridged, one can see that the reception timings are not only very precise (±1 μs), but also that the new implementation is able to receive while transmitting with no sweat. Really solid 👍

Regarding the recursion issue, yes it got that. I had actually done this on purpose so that the module acts as a relay of the normal remotes. But now, it listens to itself. I did add a guard against that (a boolean to filter out reception while busy transmitting) and that fixed the "echo" but not the crash.

I'll definitely follow your advice and reduce the configuration to the minimum with which I can reproduce the issue. I have 3 boards at 3 different frequencies working on the same board and around 30 codes used for both reception and transmission, there's a lot going on in the little board.

I'll report back with my findings when I have some time to play around with this again. It doesn't seem to be a bug from your improved code, but rather me hitting some limit due to the fact that the board can now do more in parallel. Luckily my contraption still works if I disable interrupts while transmitting myself, but I'm slightly sad b/c I intended to make a PR with a proper component out of this board at some point.

Cheers, thanks, and congrats again for your solid contribution!

dbuezas commented 2 years ago

Ok, I gave it a go, it looks like the crash happens when there is frequent SPI communication (in this case, the library uses SPI internally, NOT through esphome) Do you know how to generate an elf file so I can run the ESP exception decoder?

dbuezas commented 2 years ago

A detail that I forgot to mention, is that since the last version of ESPHome, I need to add SPI to the required libraries. I wonder if there is any incompatibility with that.

CarlosGS commented 2 years ago

Sorry I have not used the exception decoder nor SPI yet. Can't think of anything missing, but looking forward to learning what causes this ^^

dbuezas commented 2 years ago

Somebody helped me in discord and I finally managed to get the exception decoded. It is consistently the same one:


User exception (panic/abort/assert)
--------------- CUT HERE FOR EXCEPTION DECODER ---------------
Abort called

>>>stack>>>
WARNING Found stack trace! Trying to decode it

ctx: cont
sp: 3ffffe80 end: 3fffffc0 offset: 0000
WARNING Decoded 0x402204a2: raise_exception at core_esp8266_postmortem.cpp
WARNING Decoded 0x4024d6b7: wifi_station_get_connect_status
WARNING Decoded 0x402204b4: __assert_func
WARNING Decoded 0x4020d3c4: esphome::wifi::WiFiComponent::wifi_sta_connect_status_()
WARNING Decoded 0x4022ad5d: operator new(unsigned int) at /workdir/repo/gcc/libstdc++-v3/libsupc++/new_op.cc:57
WARNING Decoded 0x40208c04: esphome::remote_receiver::RemoteReceiverComponent::loop()
WARNING Decoded 0x40223134: esphome::Component::call_loop()
WARNING Decoded 0x402231dd: esphome::Component::call()
WARNING Decoded 0x4020dee5: esphome::Application::loop()
WARNING Decoded 0x40211714: loop
WARNING Decoded 0x4022005c: loop_wrapper() at core_esp8266_main.cpp
WARNING Decoded 0x401006e5: cont_wrapper
<<<stack<<<

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

 ets Jan  8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 3584, room 16
tail 0
chksum 0xb0
csum 0xb0
v2843a5ac
~ld```

Does this tell you anything?
dbuezas commented 2 years ago

Also played a bit with when I disable interrupts and the problem is not the SPI communication. It works just fine if I disable interrupts just before doing the transmission:

Relevant yaml:

switch:
  - platform: template
    ...
    turn_on_action:
      - lambda: get_cc1101(transciver_433_92).beginTransmission();
      - remote_transmitter.transmit_raw:
          transmitter_id: transmitter_rc
          code: [.....]
      - lambda: get_cc1101(transciver_433_92).endTransmission();

Relevant part of the custom component:

void beginTransmission()  {
    ELECHOUSE_cc1101.setModul(_moduleNumber);
    ELECHOUSE_cc1101.SetTx();
    pinMode(_GDO0, OUTPUT);
    noInterrupts(); // <--- here, right before transmission
  }
  void endTransmission()  {
    interrupts(); // <--- here, right after transmission
    pinMode(_GDO0, INPUT);
    ELECHOUSE_cc1101.setModul(_moduleNumber);
    ELECHOUSE_cc1101.SetRx();
    ELECHOUSE_cc1101.SetRx(); // yes, twice
  }
dbuezas commented 2 years ago

This is quite weird, now I'm polling the RSSI of the chip via SPI 10 times a second and I see no crashes anymore. Here's the RSSI output plotted in real time (via home assistant) while I click some remote control buttons.

Kapture 2021-12-24 at 17 02 25 .

dbuezas commented 2 years ago

What i found now:

I'm starting to think that I am just getting out of RAM. I have all the codes in raw mode, those are probably kept in ram at all times for the receiver, right?

dbuezas commented 2 years ago

Mystery almost solved

It looks like it is RAM. I switched some of the raw codes to rc_switch_raw (way less RAM usage) and it doesn't crash anymore even without turning interrupts off during transmission! It still crashes due to the recursion thing causing a watchdog exception, which makes sense to some extent.

[14:39:45][D][remote_transmitter:074]: Sending remote code...
[14:39:54]
[14:39:54] ets Jan  8 2013,rst cause:4, boot mode:(3,6)
[14:39:54]
[14:39:54]wdt reset
[14:39:54]load 0x4010f000, len 3584, room 16
[14:39:54]tail 0
[14:39:54]chksum 0xb0
[14:39:54]csum 0xb0
[14:39:54]v2843a5ac
[14:39:54]~ld

Let me know if you think all of this makes sense and I think we can close the issue. It does surprise me a bit that the RAM usage would be so close to the limit.

CarlosGS commented 2 years ago

Very nice find David, these issues are so difficult to pinpoint/debug, would have never thought of RAM limitations. Thank you for sharing all the progress, sorry I wasn't able to help! I try to understand the platform in detail but there are always surprises with embedded development. I still can't wrap my mind around it entirely but your conclusion looks very reasonable.

And nice work with the multi receiver-transmitter approach, it can be really useful for many setups. Happy Holidays too! :D

dbuezas commented 2 years ago

Final Experiments

Thanks Carlos! I made a couple of experiments and it is definitely a RAM problem. I can reproduce the problem also without the CC1101 custom component if I have enough codes and receivers.

Random variations in my configuration would bring it to crash or keep it running so I made some experiments while logging the available RAM inside a lambda. The conclusion is that I am right at the limit, so any change makes it crash just out of chance (or rather what the compiler decides to optimise).

I made a bunch of runs and this is what I got: image

Then I did a linear regression by hand, which as you can see in the right-most column is quite precise.

Ergo: It looks like I'll have to put this thing inside an ESP32 if I want to listen to all remotes and door sensors I have.

Do you see any possibility to reduce the receivers memory footprint?

--

Anyway, thanks for keeping up with me Carlos! I've learned a lot through these experiments! 🎉 🎉

dbuezas commented 2 years ago

@CarlosGS I just realised that reducing the buffer size seems to have 4x the effect on available RAM:

I peeked into the receiver code, and the buffer size is actually the length of the timings buffer, which is uint32_t, so 4 bytes per element, makes total sense now.

This gives me the following idea: why not make the buffer store uint16_t and save half the space? 2^16 us is 65ms, which is a lot for a pulse.

It should only require making sure the time distance computations rollover correctly, and that longer pulses get discarded.

CarlosGS commented 2 years ago

That's also a very interesting find, in esp32 the limit is already 15 bit (https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/peripherals/rmt.html#transmit-data) so it would make sense to save that space in ESP8266 too. If you can achieve it that would be amazing :D The docs could be improved here as well; stating the buffer_size in kB is very misleading at the moment. Nice job!!

dbuezas commented 2 years ago

Then it looks like this may become my forst PR :) If I have enough time the coming days, I may even try a native integration of the CC1101 module.

Would you mind pointing me to the right link to setup the environment? I've been using the CLI up to now.

dbuezas commented 2 years ago

Mmm I have another enquiry... According to my tests, each code takes around 233 bytes. My codes are 50 integers long, and assuming they are stored as uint32 again, that's 4bytesx50=200 bytes, matching the experiment reasonably well. Looking at the generated main.cpp, the codes are actually stored as PROGMEM, so they shouldn't actually take up RAM. But they do! Am I missing something? There's also the extra binary sensor, but 200bytes for bookkeeping of a boolean seems excessive. Could it be that there's an oversight and the arrays are copied to RAM at some point? If I'm right and this is fixable, it would extend the maximum possible count of codes by a factor of a hundred. That would be nice...

CarlosGS commented 2 years ago

I don't have an answer to that second part either, but any memory improvements would be awesome.

Would you mind pointing me to the right link to setup the environment?

You can find the documentation here: https://esphome.io/guides/contributing.html#setting-up-development-environment There's also the discord server for ESPHome where there is a lot of activity and support from more experienced devs. Cheers!