Koenkk / zigbee-herdsman

A Node.js Zigbee library
MIT License
456 stars 277 forks source link

Adapters hardware flow control issues #943

Open Nerivec opened 2 months ago

Nerivec commented 2 months ago

This issue seems to affect all adapters supporting hardware flow control, even though the below uses the SkyConnect as example.

The SkyConnect has been exhibiting strange behaviors regarding hardware flow control (with both ezsp and ember), and a few issues have been raised about it. This issue regroups previously started discussions in one thread. I dug out a couple of reports I remembered, but more are buried in other issues...

Currently reported behaviors:

1) SkyConnect triggers RESET_WATCHDOG error on first contact (c20203d20a7e), then resets properly if a second attempt is made immediately after. 2) SkyConnect does not start if rtscts is enabled (same as above, but crashes soon after).

The issue is present in firmware as early as 7.3.2.0 build 212.

NOTE: With ezsp driver, 1.35.2 changes the behavior from having both rtscts & xon/xoff enabled at the same time (if rtscts enabled) to just rtscts if enabled, else xon/xoff. While the change fixed issues for some devices; it also brought out the issue with the SkyConnect, that does not seem to want to work without xon/xoff. before / after


@Rookman2022 ezsp / Home Assistant link

I have tested again Version 135.1-1 also runs with "rtscts: true" after the update to 135.3-1 the error appears, see log. I have installed the firmware from Nabu Casa on my SkyConnect V7.3.2.0 build 212

Logs ```logs Zigbee2MQTT:info 2024-02-20 11:14:35: Logging to console and directory: '/config/zigbee2mqtt/log/2024-02-20.11-14-35' filename: log.txt\ Zigbee2MQTT:info 2024-02-20 11:14:35: Starting Zigbee2MQTT version 1.35.3 (commit #unknown)\ Zigbee2MQTT:info 2024-02-20 11:14:35: Starting zigbee-herdsman (0.33.8)\ 2024-02-20T10:14:36.313Z zigbee-herdsman:adapter Failed to validate path: 'Error: spawn udevadm ENOENT'\ 2024-02-20T10:14:36.323Z zigbee-herdsman:adapter:ezsp:debg Adapter concurrent: 8\ 2024-02-20T10:14:36.326Z zigbee-herdsman:controller:log Starting with options '\{"network":\{"networkKeyDistribute":false,"networkKey\'93:[x,x,x,x,x,xx,xxx,,xxx,x,x,x,x,],\'93panID":49141,"extendedPanID\'93:[x,x,x,x,x,x,],\'93channelList":[20]\},"serialPort":\{"baudRate":115200,"rtscts":true,"path":"/dev/ttyUSB0","adapter":"ezsp"\},"databasePath":"/config/zigbee2mqtt/database.db","databaseBackupPath":"/config/zigbee2mqtt/database.db.backup","backupPath":"/config/zigbee2mqtt/coordinator_backup.json","adapter":\{"disableLED":true,"concurrent":null,"delay":null\}\}'\ 2024-02-20T10:14:36.341Z zigbee-herdsman:adapter:ezsp:uart Opening SerialPort with \{"path":"/dev/ttyUSB0","baudRate":115200,"rtscts":true,"autoOpen":false,"parity":"none","stopBits":1,"xon":false,"xoff":false\}\ 2024-02-20T10:14:36.385Z zigbee-herdsman:adapter:ezsp:uart Serialport opened\ 2024-02-20T10:14:36.387Z zigbee-herdsman:adapter:ezsp:uart Uart reseting\ 2024-02-20T10:14:36.389Z zigbee-herdsman:adapter:ezsp:uart --> Write reset\ 2024-02-20T10:14:36.393Z zigbee-herdsman:adapter:ezsp:uart --> [1ac038bc7e]\ 2024-02-20T10:14:36.397Z zigbee-herdsman:adapter:ezsp:uart -?- waiting reset\ 2024-02-20T10:14:37.678Z zigbee-herdsman:adapter:ezsp:uart <-- [c20203d20a7e]\ 2024-02-20T10:14:37.682Z zigbee-herdsman:adapter:ezsp:uart --> parsed c20203d20a7e\ 2024-02-20T10:14:37.685Z zigbee-herdsman:adapter:ezsp:uart <-- Error c20203d20a7e\ 2024-02-20T10:14:37.685Z zigbee-herdsman:adapter:ezsp:uart Uart reseting\ 2024-02-20T10:14:37.685Z zigbee-herdsman:adapter:ezsp:uart --> Write reset\ 2024-02-20T10:14:37.687Z zigbee-herdsman:adapter:ezsp:uart --> [1ac038bc7e]\ 2024-02-20T10:14:37.688Z zigbee-herdsman:adapter:ezsp:uart -?- waiting reset\ 2024-02-20T10:14:38.762Z zigbee-herdsman:adapter:ezsp:uart <-- [c1020b0a527e]\ 2024-02-20T10:14:38.762Z zigbee-herdsman:adapter:ezsp:uart --> parsed c1020b0a527e\ 2024-02-20T10:14:38.763Z zigbee-herdsman:adapter:ezsp:uart <-- RSTACK c1020b0a527e\ 2024-02-20T10:14:38.763Z zigbee-herdsman:adapter:ezsp:uart RSTACK Version: 2 Reason: RESET_SOFTWARE frame: c1020b0a527e\ 2024-02-20T10:14:38.766Z zigbee-herdsman:adapter:ezsp:uart -+- waiting reset success\ 2024-02-20T10:14:38.766Z zigbee-herdsman:adapter:ezsp:uart -+- waiting reset success\ ```

@newlund ezsp / Home Assistant link

I had a lot of problems starting zigbee2mqtt with my skyconnect (fw 7.3.2.0 build 212). Now I disabled RTS / CTS and it works like a charm every time.

Logs ```logs Zigbee2MQTT:info 2024-02-05 18:45:05: Starting Zigbee2MQTT version 1.35.2 (commit #unknown) Zigbee2MQTT:info 2024-02-05 18:45:05: Starting zigbee-herdsman (0.33.5) Zigbee2MQTT:debug 2024-02-05 18:45:05: Using zigbee-herdsman with settings: '{"adapter":{"concurrent":null,"delay":null,"disableLED":false},"backupPath":"/config/zigbee2mqtt/coordinator_backup.json","databaseBackupPath":"/config/zigbee2mqtt/database.db.backup","databasePath":"/config/zigbee2mqtt/database.db","network":{"channelList":[20],"extendedPanID":[221,221,221,221,221,221,221,221],"networkKey":"HIDDEN","panID":6754},"serialPort":{"adapter":"ezsp","baudRate":115200,"path":"/dev/serial/by-id/usb-Nabu_Casa_SkyConnect_v1.0_c8ec6f61dd93ed1181f07ff23b20a988-if00-port0","rtscts":true}}' 2024-02-05T17:45:05.585Z zigbee-herdsman:adapter Failed to validate path: 'Error: spawn udevadm ENOENT' 2024-02-05T17:45:05.586Z zigbee-herdsman:adapter:ezsp:debg Adapter concurrent: 8 2024-02-05T17:45:05.587Z zigbee-herdsman:controller:log Starting with options '{"network":{"networkKeyDistribute":false,"networkKey":[1,3,5,7,9,11,7,15,0,2,7,4,8,7,4,13],"panID":6754,"extendedPanID":[221,221,221,221,221,221,221,221],"channelList":[20]},"serialPort":{"baudRate":115200,"rtscts":true,"path":"/dev/serial/by-id/usb-Nabu_Casa_SkyConnect_v1.0_c8ec6f61dd93ed1181f07ff23b20a988-if00-port0","adapter":"ezsp"},"databasePath":"/config/zigbee2mqtt/database.db","databaseBackupPath":"/config/zigbee2mqtt/database.db.backup","backupPath":"/config/zigbee2mqtt/coordinator_backup.json","adapter":{"disableLED":false,"concurrent":null,"delay":null}}' 2024-02-05T17:45:05.590Z zigbee-herdsman:adapter:ezsp:uart Opening SerialPort with {"path":"/dev/serial/by-id/usb-Nabu_Casa_SkyConnect_v1.0_c8ec6f61dd93ed1181f07ff23b20a988-if00-port0","baudRate":115200,"rtscts":true,"autoOpen":false,"parity":"none","stopBits":1,"xon":false,"xoff":false} 2024-02-05T17:45:05.616Z zigbee-herdsman:adapter:ezsp:uart Serialport opened 2024-02-05T17:45:05.616Z zigbee-herdsman:adapter:ezsp:uart Uart reseting 2024-02-05T17:45:05.617Z zigbee-herdsman:adapter:ezsp:uart --> Write reset 2024-02-05T17:45:05.618Z zigbee-herdsman:adapter:ezsp:uart --> [1ac038bc7e] 2024-02-05T17:45:05.620Z zigbee-herdsman:adapter:ezsp:uart -?- waiting reset 2024-02-05T17:45:05.622Z zigbee-herdsman:adapter:ezsp:uart <-- [c20203d20a7e] 2024-02-05T17:45:05.623Z zigbee-herdsman:adapter:ezsp:uart --> parsed c20203d20a7e 2024-02-05T17:45:05.623Z zigbee-herdsman:adapter:ezsp:uart <-- Error c20203d20a7e 2024-02-05T17:45:05.623Z zigbee-herdsman:adapter:ezsp:uart Uart reseting 2024-02-05T17:45:05.624Z zigbee-herdsman:adapter:ezsp:uart --> Write reset 2024-02-05T17:45:05.624Z zigbee-herdsman:adapter:ezsp:uart --> [1ac038bc7e] 2024-02-05T17:45:05.625Z zigbee-herdsman:adapter:ezsp:uart -?- waiting reset 2024-02-05T17:45:06.724Z zigbee-herdsman:adapter:ezsp:uart <-- [c1020b0a527e] 2024-02-05T17:45:06.725Z zigbee-herdsman:adapter:ezsp:uart --> parsed c1020b0a527e 2024-02-05T17:45:06.725Z zigbee-herdsman:adapter:ezsp:uart <-- RSTACK c1020b0a527e 2024-02-05T17:45:06.725Z zigbee-herdsman:adapter:ezsp:uart RSTACK Version: 2 Reason: RESET_SOFTWARE frame: c1020b0a527e 2024-02-05T17:45:06.726Z zigbee-herdsman:adapter:ezsp:uart -+- waiting reset success 2024-02-05T17:45:06.726Z zigbee-herdsman:adapter:ezsp:uart -+- waiting reset success ```

@Mar1usW3 ember / Home Assistant link

Here, issue seems present with or without rtscts enabled... but appears only randomly on start/restart of Z2M (i.e. sometimes it works, sometimes it doesn't).

See https://github.com/Koenkk/zigbee2mqtt/discussions/21462#discussioncomment-8644908 for more detailed logs on testing custom inverted RTS/CTS firmware vs regular firmware.

Logs ```logs 2024-02-29T06:44:43.579Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 300/2500 2024-02-29T06:44:43.682Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 400/2500 2024-02-29T06:44:43.786Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 500/2500 2024-02-29T06:44:43.891Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 600/2500 2024-02-29T06:44:43.993Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 700/2500 2024-02-29T06:44:44.098Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 800/2500 2024-02-29T06:44:44.206Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 900/2500 2024-02-29T06:44:44.310Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 1000/2500 2024-02-29T06:44:44.412Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 1100/2500 2024-02-29T06:44:44.485Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=1ac20203d20a7e] 2024-02-29T06:44:44.487Z zigbee-herdsman:adapter:ember:uart:ash <--- [FRAME type=ERROR] Received ERROR from NCP while connecting, with code=RESET_WATCHDOG. ASH disconnected | NCP status: ASH_NCP_FATAL_ERROR 2024-02-29T06:44:44.487Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=ASH_NCP_FATAL_ERROR. ```

@Nerivec

This behavior seems a bit too familiar to the SLZB-07, and you mentioned the SkyConnect is similar firmware-wise. Any chance something weird is going on here too? Same watchdog reset, but this time quite randomly on start it seems... I've seen several people mention recently, that either their SkyConnect won't work with RTS/CTS, or the exact opposite, that it won't work without it (that one being the expected behavior). That would suggest some batches of the SkyConnect may not have the same hardware/wiring..? Any chance you had any feedback on this, or noticed this behavior too?

@darkxst

I have my suspicions that there maybe deeper issues with hw flow control in the lastest Gecko SDK's. From my testing with SLZB-07 everything was working with the previous rts/cts config, and its still working now I swapped the config to match schematics. I would expect the incorrect rts/cts config to throw hard errors, but only error observed is the WATCHDOG Error, else where ZHA, Multiprotocol etc seems to work fine with both configs without errors.

@Nerivec

The issue seems present in 7.3.x too though, according to the reports I have seen (https://github.com/Koenkk/zigbee2mqtt/issues/20868#issuecomment-1923283088). I suppose the inconsistencies could also come from either the drivers (i.e. potentially including subtle differences from system to system, explaining the perceived randomness), or the serial library serialport (but that one would affect everyone the same I would think). With NCP reset retry logic "hiding" that kind of error, as long as it eventually starts (only visible if you look closely at the logs... both Z2M & ZHA), it might have been happening for a while now on the affected setups...

@darkxst

Yeh definitely strange, I wonder if similar issues are happening on ZHA. @puddly have you seen any similar issues on ZHA? @Nerivec So I guess one hypothesis is that the watchdog is firing because there is no data flowing, and then after reset it somehow falls back to no flow control. If you can think of a way to validate that from the ASH/serial side and confirm if hardware flow control is actually working, that would be a start! I will later see if I can hookup my debugger to SLZB-07 for some further investigation, but wont be until next week probably.

@puddly

There has always been only a single board revision for the SkyConnect and it has hardware flow control: https://github.com/NabuCasa/silabs-firmware-builder/blob/dfbe532bb7533b5fac67c16a350558e805d485dc/manifests/skyconnect_ncp-uart-hw.yaml

@darkxst

The core issue seems to be this WATCHDOG error, this log is from SLZB-07 but apparently the same error is occurring on SkyConnect (both when using the config you linked above): https://github.com/Koenkk/zigbee2mqtt/discussions/21462#discussioncomment-8579822 I had thought the SLZB-07 has same pinout as SkyConnect, but when I checked SLZB schematics CTS/RTS were reversed compared to the SkyConnect config. So I rebuilt SLZB-07 with these pins swapped and the errors disappeared.


A few TODOs that I can think off...:

If anyone with the issue is able to test any of this, feedback is appreciated. I can also supply custom code on ember to try to eliminate a few possible causes (find me on Discord).

puddly commented 2 months ago

I've re-checked the SkyConnect PCB and all past firmwares. The RTS and CTS pins are correctly hooked up to the CP2102N, I do not believe they are swapped. All known firmwares (including the version the SkyConnect ships with) use the same pin configuration.

Similarly, I am not able to reproduce this problem with ZHA:

2024-03-01 15:03:56.364 ubuntu zigpy.serial DEBUG Opening a serial connection to '/dev/ttyUSB2' (115200 baudrate)
2024-03-01 15:03:56.369 ubuntu bellows.ezsp DEBUG Resetting EZSP
2024-03-01 15:03:56.369 ubuntu bellows.uart DEBUG Resetting ASH
2024-03-01 15:03:56.369 ubuntu bellows.uart DEBUG Sending: b'1ac038bc7e'
2024-03-01 15:03:57.458 ubuntu bellows.uart DEBUG RSTACK Version: 2 Reason: RESET_SOFTWARE frame: b'c1020b0a527e'

I am able to trigger a RESET_WATCHDOG once during firmware startup but every subsequent connect is RESET_SOFTWARE.

gam-st commented 2 months ago

The RTS and CTS pins are correctly hooked up to the CP2102N

what does "correctly" mean? Is it:

CTS -> PB00
RTS -> PB01

?

Nerivec commented 2 months ago

I am able to trigger a RESET_WATCHDOG once during firmware startup but every subsequent connect is RESET_SOFTWARE.

So the first contact (RST from a cold start) also triggers a RESET_WATCHDOG ERROR frame, then it starts without problem (as in, retry logic kicks in, clears the NCP failed state, then proper RSTACK is received)? What do you mean exactly by every subsequent connect?

It is very strange that for @Mar1usW3 the firmware with swapped pins immediately makes the error disappear, and as soon as a regular firmware is introduced the error reappears...

The issue seems predominant in "HA+Z2M addon", but that could just be because most users use that combo... Still, worth mentioning...

I suppose someone will also have to go through the serial libraries (JS & python...), to make sure everything's alright in there, nothing swapped, nothing improperly set...

puddly commented 2 months ago

what does "correctly" mean?

Correctly here I think means CP2102N CTS -> EFR32MG21 RTS (PB00) and CP2102N RTS -> EFR32MG21 CTS (PB01).

I can double check with a logic analyzer when I have a bit more time.

What do you mean exactly by every subsequent connect?

I am testing with zigpy-cli, which provides command line access to the zigpy libraries. It would be identical to restarting HA Core. If I connect after just plugging in the stick, the watchdog error is seen in logs. If I close the connection and re-connect after that, the watchdog error never appears again.

gam-st commented 2 months ago

what does "correctly" mean?

Correctly here I think means CP2102N CTS -> EFR32MG21 RTS (PB00) and CP2102N RTS -> EFR32MG21 CTS (PB01).

But linked in config different https://github.com/NabuCasa/silabs-firmware-builder/blob/dfbe532bb7533b5fac67c16a350558e805d485dc/manifests/skyconnect_ncp-uart-hw.yaml

  SL_IOSTREAM_USART_VCOM_CTS_PORT: gpioPortB
  SL_IOSTREAM_USART_VCOM_CTS_PIN:  1

  SL_IOSTREAM_USART_VCOM_RTS_PORT: gpioPortB
  SL_IOSTREAM_USART_VCOM_RTS_PIN:  0
Nerivec commented 2 months ago

Seems sky-swap firmware is indeed fixing the RESET_WATCHDOG init error... I've had several reports of it now.

@darkxst People are reporting the same issue with the HA Yellow... Found logs as old as 2022 that have the same error (just searched the web with c20203d20a7e... link, link), even if the retry logic bypasses it (most of the time anyway). Looks like this issue has been hiding for a while... (only found it with ember driver because I had initially disabled the init retry logic, to ensure hard fails during testing...)

gam-st commented 2 months ago

Looks like this issue has been hiding for a while...

Gentlemen, I don't understand why this has become a problem - isn't it a matter of settings in fw? E.g. these ones:

SL_IOSTREAM_USART_VCOM_CTS_PORT: gpioPortB
 SL_IOSTREAM_USART_VCOM_CTS_PIN:  1

 SL_IOSTREAM_USART_VCOM_RTS_PORT: gpioPortB
 SL_IOSTREAM_USART_VCOM_RTS_PIN:  0

so just swap them to

SL_IOSTREAM_USART_VCOM_CTS_PORT: gpioPortB
 SL_IOSTREAM_USART_VCOM_CTS_PIN:  0

 SL_IOSTREAM_USART_VCOM_RTS_PORT: gpioPortB
 SL_IOSTREAM_USART_VCOM_RTS_PIN:  1

or is something else behind this topic?

puddly commented 2 months ago

The fact that it eventually works with either configuration (and no configuration!) makes it sound like this is a software/kernel problem, not a hardware/firmware issue. The wiring diagram I've posted above is correct as far as I can tell. RTS and CTS are wired to the opposite pin, same as TX and RX (as in the datasheet):

image

Similarly, from the same CP2102N datasheet:

RTS, or Ready To Send, is an active-low output from the CP2102N and indicates to the external UART device that the CP2102N’s UART RX FIFO has not reached the FLOW OFF watermark level of 448 bytes and is ready to accept more data. When the amount of data in the RX FIFO reaches the watermark, the CP2102N pulls RTS high to indicate to the external UART device to stop sending data.

So as long as the CP2102N is able to receive data, it keeps the CP2102N RTS pin low. When it can't, it pulls it high. The CP2102N RTS pin is wired to the EFR32's CTS pin: the EFR32 assumes it is "clear to send" as long as the CP2102N is signaling "ready to send".


The watermark level being 448 bytes by default makes it sound like hardware flow control will never actually be used (i.e. the RTS and CTS pins will never toggle). The underlying problem is probably a CP2102N pin state configuration issue with the kernel driver or possibly a firmware bug keeping RTS pin in the wrong state until a reset. If that is the case, running with no flow control (i.e. no XONXOFF and no RTSCTS) should work fine.

I'll try it out on an oscilloscope in a few days.

agners commented 2 months ago

@Nerivec your description of the problem somehow triggered some neurons here: I've seen similar issues with ZHA/bellows too, very early on. Those got fixed with https://github.com/zigpy/bellows/pull/459.

From what I understand, the EmberZNet firmware by default sends out a reset reason. However, on most devices this reset reason just get sent into nirvana, since the other side is not (yet) listening at all (and there is no flow control, so the firmware just starts sending). For USB devices this essentially might be a race condition, e.g. the SiLabs chip starts quicker than the CP2102N can be enumerated an initialized by the operating system.

With Yellows/SkyConnects firmware/pin settings, the flow control is by default set to "not ready". Only when the host system actually opens the serial port, the flow control indicates "ready" such that bytes flow. That means you get this very first communication from the EmberZNet firmware (which is the reset reason).

Swapping the pins likely lead to a configuration where flow control is set to "read" from a firmware perspective. So the reset reason ended up in nirvana (and Z2M subsequently working).

Nerivec commented 2 months ago

That's not my area of expertise at all (I just gathered everything I could so someone else could figure this one out 😁) but my guess is, @puddly is right, it's OS-level stuff that's messing things up (even though the retry logic in ZHA/Z2M bypasses it most of the time). All I can say is, it seems very inconsistent; for some users, it never happens, for some, it always happens, for some, it happens sometimes, for some it happens but gets bypassed by retry logic, and for some it happens and retry logic sends the adapter into a tailspin. Seeing an ERROR frame in response to RST, definitely does not seem like normal firmware behavior though. If there was a "default" reset reason automatically sent by firmware even before the host sends the RST, I'd expect it to be another RSTACK?

The one thing that definitely seems to remove the problem entirely at the moment -no more retries at all- (confirmed on SkyConnect, Yellow, and SLZB-07) is firmware with software flow control only...

agners commented 2 months ago

Seeing an ERROR frame in response to RST, definitely does not seem like normal firmware behavior though.

From what I remember, this is not in response to RST. I think this is sent before you even send a RST. Just open the port, and you'll see it. Now if you send a RST directly on open this might "look" like it is the respond to RST. But in the end that is the "beauty" of UART - universal asynchronous receiver / transmitter :sweat_smile: :see_no_evil:

Now if that RSTACK frame is "normal firmware behavior" we can debate over, we'd probably have to ask SiLabs. From Silicon Labs own UG101: UART-EZSP Gateway Protocol Reference, it seems so:

  1. CONNECTED State The Host initializes the ASH protocol by resetting the NCP through either the nRESET pin or sending the RST frame. When the NCP is fully booted and ready to interact with the Host, the NCP will send the RSTACK frame to the Host.

Silicon Labs essentially gives two reset option, either through the hardware nRESET pin or the RST frame. It seems that in both cases the firmware is expected to send a RSTACK frame ("the NCP will send the RSTACK frame to the Host").

Now when you plug-in or (boot-up) a computer with a SkyConnect, on the SkyConnect the nRESET pin is essentially asserted. So inside the SkyConnect, the NCP firmware boots up an sends the RSTACK frame. Now since SkyConnect has proper hardware flow control, that RSTACK frame actually waits on the transmit buffer until the host side (Linux, Z2M/bellows or whoever is going to interact with the stick) opens the serial port. So the first thing you'll see is that RSTACK frame.

That initial RSTACK "error" frame is just the firmware telling you that it got freshly out of reset.

In the end Z2M (just like bellows) should properly process that frame, even when it is the first thing it sees. Ideally probably it should skip sending a RST in that case (as the frame is telling the host that the firmware just got booted, no need to send another reset). But then, just issuing another RST certainly also doesn't hurt... It just will slow down initialization by $firmware-statup-time...

But yeah, during regular operation a RSTACK frame should be consider problematic, especially if RESET_WATCHDOG is the reason, as it would mean the firmware really crashed for some reason.

The one thing that definitely seems to remove the problem entirely at the moment -no more retries at all- (confirmed on SkyConnect, Yellow, and SLZB-07) is firmware with software flow control only...

Afaik, software flow control starts with XON by default, meaning the flow is enabled. So this means the firmware on startup just starts to send things (like this RSTACK) even though the host is not listening yet. Which in the end is why you don't see the frame on startup. No big deal, as you'll send a RST and therfore reinitialize communication. But that explains why software flow control seems to fix the problem for you.

but my guess is, @puddly is right, it's OS-level stuff that's messing things up

Not sure what you refer to here exactly, but as I've noted in the linked PR, the OS can influence behavior indeed, quote from https://github.com/zigpy/bellows/pull/459:

Note: With hardware flow control this is only the case if the tty "hupcl" option is set. The option is set by default, but cleared by tools like GNU screen. This option makes sure that the RTS signal is deasserted while the port is closed. Pyserial/bellows does not change the state of that option.

In other words, the OS might have left the serial port in a state where "ready" is signaled through flow control. In that case you'll not get the initial RSTACK frame. So depending on what touched the serial port before, YMMV!

But ideally, any software communicating with the EmberZNet firmware through Linux tty layer/serial port should not make assumption about how the serial port was/is configured.

Nerivec commented 2 months ago

But that explains why software flow control seems to fix the problem for you.

Just to be clear (for the below), I don't have a device with hw flow control at the moment (that sure is not making it easy to test any of this, but one is in the mail!)... These are only reports I got from other users, that I relayed to firmware builders (who certainly know more on the subject than I currently do 😄)

Just open the port, and you'll see it.

I can only refer to the logs for the moment... Can you confirm this on your end? Assuming the serialport logic is doing its job, I don't think it would be the case here. (In ember) The port is opened, then flushed, then the RST is sent, and only several ms later we get a response from the NCP that it entered the failed state (even though the serial parser has been registered from the moment the port was opened):

2024-02-29T06:44:43.579Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 300/2500
2024-02-29T06:44:43.682Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 400/2500
2024-02-29T06:44:43.786Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 500/2500
2024-02-29T06:44:43.891Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 600/2500
2024-02-29T06:44:43.993Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 700/2500
2024-02-29T06:44:44.098Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 800/2500
2024-02-29T06:44:44.206Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 900/2500
2024-02-29T06:44:44.310Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 1000/2500
2024-02-29T06:44:44.412Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 1100/2500
2024-02-29T06:44:44.485Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=1ac20203d20a7e]
2024-02-29T06:44:44.487Z zigbee-herdsman:adapter:ember:uart:ash <--- [FRAME type=ERROR]
Received ERROR from NCP while connecting, with code=RESET_WATCHDOG.

That initial RSTACK "error" frame is just the firmware telling you that it got freshly out of reset.

What doesn't look right to me, is that it's not coming as an RSTACK (0xC1) at all, it's an ERROR frame (0xC2), which means the NCP is in failed state at that point...

Nerivec commented 2 months ago

@darkxst Any chance you had some time to debug on the SLZB-07?

darkxst commented 2 months ago

@darkxst Any chance you had some time to debug on the SLZB-07?

No not yet, been a bit busy with other projects this week.

What doesn't look right to me, is that it's not coming as an RSTACK (0xC1) at all, it's an ERROR frame (0xC2), which means the NCP is in failed state at that point...

From the UG101 doc linked above, it sounds like its ok to throw away these initial errors, until a valid RSTACK frame is received.

Due to possible I/O buffering, it is important to note that the Host could receive several valid or invalid frames after triggering a reset of the NCP. The Host must discard all frames and errors until a valid RSTACK frame is received. The Host must also allow a certain amount of time to receive the RSTACK frame; the parameter T_RSTACK_MAX defines this timeout. If the Host does not receive the RSTACK frame within the timeout period, the Host should retry the reset process up to five times in case of noise interfering with the RST or RSTACK frames.

agners commented 2 months ago

But that explains why software flow control seems to fix the problem for you.

Just to be clear (for the below), I don't have a device with hw flow control at the moment (that sure is not making it easy to test any of this, but one is in the mail!)... These are only reports I got from other users, that I relayed to firmware builders (who certainly know more on the subject than I currently do 😄)

Kk, I see, so substitute "fix the problem for you" with "fix the problem for some folks" :sweat_smile:

Just open the port, and you'll see it.

I can only refer to the logs for the moment... Can you confirm this on your end?

I tried quickly with a SkyConnect and it's vanilla firmware 7.1.1.0, and actually was not able to reproduce:

stty -F /dev/ttyUSB1 115200 raw crtscts min 0
cat /dev/ttyUSB1 | hexdump -C

My thoughts are mostly from memory + what I've documented in those issue reports. Looking at those I am pretty sure that I did all those tests on Yellow back then. I am guessing SkyConnect could really behave different here.

Assuming the serialport logic is doing its job, I don't think it would be the case here. (In ember) The port is opened, then flushed

This flushes the local buffers. If the port just got opened, (and with that hardware flow control released), it might be that those bytes come in after the flush....

then the RST is sent, and only several ms later we get a response from the NCP that it entered the failed state

At a baudrate of 115200 a character is sent every ~0.1ms (approximately, 8 bits + start/stop bits). So such an error frame of 6 bytes length should be transferred well within 1ms. If the response is "several milliseconds later", this feels like the failed state is in response to the reset, I agree. Especially since I can't see those bytes on SkyConnect.

What doesn't look right to me, is that it's not coming as an RSTACK (0xC1) at all, it's an ERROR frame (0xC2), which means the NCP is in failed state at that point...

Uh, I see, this is also different from my observations in https://github.com/zigpy/bellows/pull/459#issuecomment-1122715083, where the raw frame was c102029b7b7e :thinking:

So really looks like it is something different what is observed on SkyConnect.

Nerivec commented 2 months ago

"fix the problem for some folks"

That's it 😁 By the way, re-reading this thread for the nth time, I noticed I forgot to mention you in the first post, sorry about that. I was sure I did... neurons of my own... misfiring...

I ran some tests on the Dongle-E 7.4.1 (out of curiosity, to see exactly what happens with xon/xoff only). I can see a RSTACK(RESET_POWERON) consistently when plugged in, which, seems to be proper (of course, by the time Z2M/ZHA starts up, it has been thrown into oblivion). I never was able to see anything but that RSTACK though, even after pilling up all the usual "don't-do-that" interferences I could, to make the connection "bad", see how much "noise" I could get. 😅

If the response is "several milliseconds later"

Definitely, from the two dozen or so logs I've gone through on this, the fastest I've seen it coming in was approx. 600ms post-RST, most times it's about 1sec (which is about the time it takes for a successful RSTACK to come in too).

From my observations, most times, the retry logic takes care of sending another RST upon receipt of the ERROR(RESET_WATCHDOG) to get out of failed state, then we get a proper RSTACK, so I suppose that would be somewhat alright (that's what ZHA/Z2M currently do), but every so often, that dog seems to be playing tricks, because we get (without closing/re-opening the port between retries):

Yellow with `rtscts: false`, `xon: true`, `xoff: true` ``` ======== EZSP starting ======== ======== ASH NCP reset ======== 2024-03-03T02:23:13.958Z zigbee-herdsman:adapter:ember:uart:ash RTS/CTS config is off, enabling software flow control. 2024-03-03T02:23:13.959Z zigbee-herdsman:adapter:ember:uart:ash Opening serial port with {"path":"/dev/ttyAMA1","baudRate":115200,"rtscts":false,"autoOpen":false,"parity":"none","stopBits":1,"xon":true,"xoff":true} 2024-03-03T02:23:13.981Z zigbee-herdsman:adapter:ember:uart:ash Serial port opened ======== ASH starting ======== 2024-03-03T02:23:13.986Z zigbee-herdsman:adapter:ember:uart:ash ---> [FRAME type=RST] 2024-03-03T02:23:13.988Z zigbee-herdsman:adapter:ember:uart:ash:writer >>>> [FRAME raw=1ac038bc7e] 2024-03-03T02:23:13.992Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 0/2500 2024-03-03T02:23:14.096Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 100/2500 2024-03-03T02:23:14.197Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 200/2500 2024-03-03T02:23:14.298Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 300/2500 2024-03-03T02:23:14.399Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 400/2500 2024-03-03T02:23:14.501Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 500/2500 2024-03-03T02:23:14.601Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 600/2500 2024-03-03T02:23:14.702Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 700/2500 2024-03-03T02:23:14.803Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 800/2500 2024-03-03T02:23:14.904Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 900/2500 2024-03-03T02:23:14.925Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=1ac20203d20a7e] 2024-03-03T02:23:14.928Z zigbee-herdsman:adapter:ember:uart:ash <--- [FRAME type=ERROR] Received ERROR from NCP while connecting, with code=RESET_WATCHDOG. ASH disconnected | NCP status: ASH_NCP_FATAL_ERROR 2024-03-03T02:23:14.930Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=ASH_NCP_FATAL_ERROR. 2024-03-03T02:23:15.006Z zigbee-herdsman:adapter:ember:uart:ash ---> [FRAME type=RST] 2024-03-03T02:23:15.007Z zigbee-herdsman:adapter:ember:uart:ash:writer >>>> [FRAME raw=1ac038bc7e] 2024-03-03T02:23:15.007Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 0/2500 ======== ASH NCP reset ======== ======== ASH starting ======== 2024-03-03T02:23:15.109Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 100/2500 2024-03-03T02:23:15.210Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 200/2500 2024-03-03T02:23:15.311Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 300/2500 2024-03-03T02:23:15.413Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 400/2500 2024-03-03T02:23:15.514Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 500/2500 2024-03-03T02:23:15.615Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 600/2500 2024-03-03T02:23:15.716Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 700/2500 2024-03-03T02:23:15.817Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 800/2500 2024-03-03T02:23:15.918Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 900/2500 2024-03-03T02:23:16.013Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=1ac1020b0a527e] 2024-03-03T02:23:16.014Z zigbee-herdsman:adapter:ember:uart:ash <--- [FRAME type=RSTACK] ======== ASH connected ======== ======== ASH started ======== ======== EZSP started ======== 2024-03-03T02:23:16.024Z zigbee-herdsman:adapter:ember:ezsp ===> [FRAME: ID=0:"VERSION" Seq=0 Len=4] 2024-03-03T02:23:16.026Z zigbee-herdsman:adapter:ember:uart:ash ---> [FRAME type=ACK frmRx=0] 2024-03-03T02:23:16.028Z zigbee-herdsman:adapter:ember:uart:ash ---> [FRAME type=DATA frmTx=0 frmRx=0] 2024-03-03T02:23:16.029Z zigbee-herdsman:adapter:ember:uart:ash:writer >>>> [FRAME raw=8070787e004221a8597c057e] 2024-03-03T02:23:17.104Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=1ac1020b0a527e] Frame(s) in progress cancelled in [1ac1020b0a527e] 2024-03-03T02:23:17.107Z zigbee-herdsman:adapter:ember:uart:ash <--- [FRAME type=RSTACK] Received unexpected reset from NCP, with reason=RESET_SOFTWARE. ASH disconnected: ASH_ERROR_NCP_RESET | NCP status: ASH_NCP_FATAL_ERROR 2024-03-03T02:23:17.109Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=HOST_FATAL_ERROR. 2024-03-03T02:23:30.431Z zigbee-herdsman:adapter:ember:ezsp =x=> [FRAME: ID=0:"VERSION" Seq=0 Len=4] Error: Error: timed out after 14400ms ERROR Transaction failure; status=undefined. Last: [FRAME: ID=0:"VERSION" Seq=0 Len=4]. !!! NCP FATAL ERROR reason=undefined. ATTEMPTING RESET... !!! ┌──────────────┬──────────┬─────────────┐ │ (index) │ Received │ Transmitted │ ├──────────────┼──────────┼─────────────┤ │ Total frames │ 2 │ 3 │ │ Cancelled │ 1 │ 0 │ │ DATA frames │ 0 │ 1 │ │ DATA bytes │ 0 │ 4 │ │ Retry frames │ 0 │ 0 │ │ ACK frames │ 0 │ 1 │ │ NAK frames │ 0 │ 0 │ │ nRdy frames │ 0 │ 0 │ └──────────────┴──────────┴─────────────┘ ┌──────────────────┬──────────┐ │ (index) │ Received │ ├──────────────────┼──────────┤ │ CRC errors │ 0 │ │ Comm errors │ 0 │ │ Length < minimum │ 0 │ │ Length > maximum │ 0 │ │ Bad controls │ 0 │ │ Bad lengths │ 0 │ │ Bad ACK numbers │ 0 │ │ Out of buffers │ 0 │ │ Retry dupes │ 0 │ │ Out of sequence │ 0 │ │ ACK timeouts │ 0 │ └──────────────────┴──────────┘ Zigbee2MQTT:error 2024-03-02 20:23:30: Error while starting zigbee-herdsman Zigbee2MQTT:error 2024-03-02 20:23:30: Failed to start zigbee Zigbee2MQTT:error 2024-03-02 20:23:30: Check https://www.zigbee2mqtt.io/guide/installation/20_zigbee2mqtt-fails-to-start.html for possible solutions Zigbee2MQTT:error 2024-03-02 20:23:30: Exiting... Zigbee2MQTT:error 2024-03-02 20:23:30: Error: at Ezsp.ezspVersion (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:1306:19) at EmberAdapter.emberVersion (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:1746:66) at EmberAdapter.initEzsp (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:789:9) at EmberAdapter.start (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:2719:24) at Controller.start (/app/node_modules/zigbee-herdsman/src/controller/controller.ts:132:29) at Zigbee.start (/app/lib/zigbee.ts:62:27) at Controller.start (/app/lib/controller.ts:109:27) at start (/app/index.js:107:5) Port closed. Error? no 2024-03-03T02:23:30.776Z zigbee-herdsman:adapter:ember:uart:ash Serial port closed. ======== ASH stopped ======== ======== EZSP stopped ======== ======== Ember Adapter Stopped ======== ```
Yellow with `rtscts: true`, `xon: false`, `xoff: false` ``` ======== EZSP starting ======== ======== ASH NCP reset ======== 2024-03-03T17:53:42.097Z zigbee-herdsman:adapter:ember:uart:ash Opening serial port with {"path":"/dev/ttyAMA1","baudRate":115200,"rtscts":true,"autoOpen":false,"parity":"none","stopBits":1,"xon":false,"xoff":false} 2024-03-03T17:53:42.126Z zigbee-herdsman:adapter:ember:uart:ash Serial port opened ======== ASH starting ======== 2024-03-03T17:53:42.134Z zigbee-herdsman:adapter:ember:uart:ash ---> [FRAME type=RST] 2024-03-03T17:53:42.138Z zigbee-herdsman:adapter:ember:uart:ash:writer >>>> [FRAME raw=1ac038bc7e] 2024-03-03T17:53:42.144Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 0/2500 2024-03-03T17:53:42.248Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 100/2500 2024-03-03T17:53:42.350Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 200/2500 2024-03-03T17:53:42.451Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 300/2500 2024-03-03T17:53:42.553Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 400/2500 2024-03-03T17:53:42.654Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 500/2500 2024-03-03T17:53:42.755Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 600/2500 2024-03-03T17:53:42.856Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 700/2500 2024-03-03T17:53:42.874Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=1ac20203d20a7e] 2024-03-03T17:53:42.878Z zigbee-herdsman:adapter:ember:uart:ash <--- [FRAME type=ERROR] Received ERROR from NCP while connecting, with code=RESET_WATCHDOG. ASH disconnected | NCP status: ASH_NCP_FATAL_ERROR 2024-03-03T17:53:42.880Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=ASH_NCP_FATAL_ERROR. ======== ASH NCP reset ======== ======== ASH starting ======== 2024-03-03T17:53:42.960Z zigbee-herdsman:adapter:ember:uart:ash ---> [FRAME type=RST] 2024-03-03T17:53:42.961Z zigbee-herdsman:adapter:ember:uart:ash:writer >>>> [FRAME raw=1ac038bc7e] 2024-03-03T17:53:42.961Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 0/2500 2024-03-03T17:53:43.062Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 100/2500 2024-03-03T17:53:43.163Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 200/2500 2024-03-03T17:53:43.265Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 300/2500 2024-03-03T17:53:43.366Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 400/2500 2024-03-03T17:53:43.467Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 500/2500 2024-03-03T17:53:43.569Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 600/2500 2024-03-03T17:53:43.669Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 700/2500 2024-03-03T17:53:43.768Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 800/2500 2024-03-03T17:53:43.869Z zigbee-herdsman:adapter:ember:uart:ash Waiting for RSTACK... 900/2500 2024-03-03T17:53:43.962Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=1ac1020b0a527e] 2024-03-03T17:53:43.962Z zigbee-herdsman:adapter:ember:uart:ash <--- [FRAME type=RSTACK] ======== ASH connected ======== ======== ASH started ======== ======== EZSP started ======== 2024-03-03T17:53:43.973Z zigbee-herdsman:adapter:ember:ezsp ===> [FRAME: ID=0:"VERSION" Seq=0 Len=4] 2024-03-03T17:53:43.975Z zigbee-herdsman:adapter:ember:uart:ash ---> [FRAME type=ACK frmRx=0] 2024-03-03T17:53:43.976Z zigbee-herdsman:adapter:ember:uart:ash ---> [FRAME type=DATA frmTx=0 frmRx=0] 2024-03-03T17:53:43.977Z zigbee-herdsman:adapter:ember:uart:ash:writer >>>> [FRAME raw=8070787e004221a8597c057e] 2024-03-03T17:53:45.053Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=1ac1020b0a527e] Frame(s) in progress cancelled in [1ac1020b0a527e] 2024-03-03T17:53:45.055Z zigbee-herdsman:adapter:ember:uart:ash <--- [FRAME type=RSTACK] Received unexpected reset from NCP, with reason=RESET_SOFTWARE. ASH disconnected: ASH_ERROR_NCP_RESET | NCP status: ASH_NCP_FATAL_ERROR 2024-03-03T17:53:45.057Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=HOST_FATAL_ERROR. 2024-03-03T17:53:45.060Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=0142a1a8592805c6a8777e] 2024-03-03T17:53:45.060Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=HOST_FATAL_ERROR. 2024-03-03T17:53:46.650Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=0942a1a8592805c6b6ad7e] 2024-03-03T17:53:46.651Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=HOST_FATAL_ERROR. 2024-03-03T17:53:49.839Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=0942a1a8592805c6b6ad7e] 2024-03-03T17:53:49.840Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=HOST_FATAL_ERROR. 2024-03-03T17:53:53.026Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=0942a1a8592805c6b6ad7e] 2024-03-03T17:53:53.026Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=HOST_FATAL_ERROR. 2024-03-03T17:53:56.213Z zigbee-herdsman:adapter:ember:uart:ash:parser <<<< [FRAME raw=c20251a8bd7e] 2024-03-03T17:53:56.213Z zigbee-herdsman:adapter:ember:uart:ash Error while parsing received frame in NOT_CONNECTED state (flags=0), status=HOST_FATAL_ERROR. 2024-03-03T17:53:58.379Z zigbee-herdsman:adapter:ember:ezsp =x=> [FRAME: ID=0:"VERSION" Seq=0 Len=4] Error: Error: timed out after 14400ms ERROR Transaction failure; status=undefined. Last: [FRAME: ID=0:"VERSION" Seq=0 Len=4]. !!! NCP FATAL ERROR reason=undefined. ATTEMPTING RESET... !!! ┌──────────────┬──────────┬─────────────┐ │ (index) │ Received │ Transmitted │ ├──────────────┼──────────┼─────────────┤ │ Total frames │ 2 │ 3 │ │ Cancelled │ 1 │ 0 │ │ DATA frames │ 0 │ 1 │ │ DATA bytes │ 0 │ 4 │ │ Retry frames │ 0 │ 0 │ │ ACK frames │ 0 │ 1 │ │ NAK frames │ 0 │ 0 │ │ nRdy frames │ 0 │ 0 │ └──────────────┴──────────┴─────────────┘ ┌──────────────────┬──────────┐ │ (index) │ Received │ ├──────────────────┼──────────┤ │ CRC errors │ 0 │ │ Comm errors │ 0 │ │ Length < minimum │ 0 │ │ Length > maximum │ 0 │ │ Bad controls │ 0 │ │ Bad lengths │ 0 │ │ Bad ACK numbers │ 0 │ │ Out of buffers │ 0 │ │ Retry dupes │ 0 │ │ Out of sequence │ 0 │ │ ACK timeouts │ 0 │ └──────────────────┴──────────┘ Zigbee2MQTT:error 2024-03-03 11:53:58: Error while starting zigbee-herdsman Zigbee2MQTT:error 2024-03-03 11:53:58: Failed to start zigbee Zigbee2MQTT:error 2024-03-03 11:53:58: Check https://www.zigbee2mqtt.io/guide/installation/20_zigbee2mqtt-fails-to-start.html for possible solutions Zigbee2MQTT:error 2024-03-03 11:53:58: Exiting... Zigbee2MQTT:error 2024-03-03 11:53:58: Error: at Ezsp.ezspVersion (/app/node_modules/zigbee-herdsman/src/adapter/ember/ezsp/ezsp.ts:1306:19) at EmberAdapter.emberVersion (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:1746:66) at EmberAdapter.initEzsp (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:789:9) at EmberAdapter.start (/app/node_modules/zigbee-herdsman/src/adapter/ember/adapter/emberAdapter.ts:2719:24) at Controller.start (/app/node_modules/zigbee-herdsman/src/controller/controller.ts:132:29) at Zigbee.start (/app/lib/zigbee.ts:62:27) at Controller.start (/app/lib/controller.ts:109:27) at start (/app/index.js:107:5) Port closed. Error? no 2024-03-03T17:53:58.769Z zigbee-herdsman:adapter:ember:uart:ash Serial port closed. ======== ASH stopped ======== ======== EZSP stopped ======== ======== Ember Adapter Stopped ======== ```
puddly commented 1 week ago

@Nerivec could you make some reproducible test script with herdsman that gets stuck during startup like this? I've yet to be able to reproduce this issue a single time on any platform I use.

Nerivec commented 1 week ago

The "getting stuck" seems to be affecting the Yellow the most for some reason (see this recent one), some kind of "panic" around the third reset attempt... I've moved the ACKing a bit earlier to avoid version being triggered before that happens, it should be able to try again now, hopefully getting rid of the "panic". There is a dozen or so older logs in here too. Doesn't seem to be a problem is non-virtualized environments (at least I haven't heard of one yet), and also no error at all (immediate clean start) with hw flow control disabled in firmware (definitely that ERROR(RESET_WATCHDOG) linked to hw flow control that's making a mess).

I pushed several fixes to attempt to get around it so previous logs won't match current logic in dev (don't have a Yellow, can't confirm yet, but automated tests say it should work...). Doesn't solve the problem, but hopefully works around it until then.

Have you been able to identify what's triggering that watchdog ERROR frame in the firmware?

puddly commented 1 week ago

Doesn't seem to be a problem is non-virtualized environments

HA OS isn't virtualized. Once the device is passed through to the Docker container, it should behave identically to one accessed from outside the container (since you're passing the device file itself through, there's no passthrough USB or serial driver involved like with a VM). Home Assistant within HA OS runs in a container just like other addons so there should be no difference in functionality.

Regarding that linked issue, the poster said this in a comment:

If I change the adapter to "ezsp" z2m starts fine.

It sounds to me like something in the new ember driver's startup sequence has changed and differs from both ezsp and bellows.

Have you been able to identify what's triggering that watchdog ERROR frame in the firmware?

Not without being able to replicate the issue, unfortunately. Once I can get a hold of a reproducible test case I can start digging into it with a development kit or with a SWD debugger attached but I'd like to avoid doing so if a fix can be done in software 😄.

Nerivec commented 1 week ago

I am able to trigger a RESET_WATCHDOG once during firmware startup but every subsequent connect is RESET_SOFTWARE.

https://github.com/Koenkk/zigbee-herdsman/issues/943#issuecomment-1973866728 This initial "failed state" (since it's an ERROR frame) is happening in ZHA and Z2M (even if there are workarounds used to bypass it during startup)... This firmware codepath does not happen in sw flow control, since the only frame you receive (if you listen right away ofc) is RSTACK(RESET_POWERON). Could be a simple case of wrong variables used I guess?

Regarding containers, there must be a difference somewhere in the handing over, because from all the tests I've ran (no matter the adapter, or hw flow control setting), the startup sequence is almost always "messier" (reset almost never succeeds on first try -especially with Z2M as HA add-on, on first start of add-on-). That never happens in bare setup (I've never been able to trigger that even once on bare debian or even windows).

The startup sequence is entirely different in ember since it's modeled from silabs' own not from bellows (like ezsp was). Several issues should now be fixed to allow the reset to retry in this scenario, but I hate piling on workarounds when a fix for the root cause would entirely eliminate the problem (and make for clean starts everywhere 😄).

puddly commented 1 week ago

Just to make sure we're on the same page here, here are the only two startup sequences I'm aware of for newer firmwares:

Sequence 1, fresh connect (1.3 seconds):

$ zigpy -vvv radio --baudrate 115200 ezsp /dev/serial/by-id/usb-Nabu_Casa_Home_Assistant_Connect_ZBT-1_9e2adbd75b8beb119fe564a0f320645d-if00-port0  info
2024-05-09 17:57:01.693 homeassistant zigpy.serial DEBUG Opening a serial connection to '/dev/serial/by-id/usb-Nabu_Casa_Home_Assistant_Connect_ZBT-1_9e2adbd75b8beb119fe564a0f320645d-if00-port0' (115200 baudrate)
2024-05-09 17:57:01.697 homeassistant bellows.ezsp DEBUG Resetting EZSP
2024-05-09 17:57:01.698 homeassistant bellows.uart DEBUG Resetting ASH
2024-05-09 17:57:01.698 homeassistant bellows.ash DEBUG Sending frame CANCEL + RstFrame() + FLAG
2024-05-09 17:57:01.698 homeassistant bellows.ash DEBUG Sending data  1ac038bc7e
2024-05-09 17:57:01.867 homeassistant bellows.ash DEBUG Received data 1ac20203d20a7e
2024-05-09 17:57:01.868 homeassistant bellows.ash DEBUG Received cancel byte, clearing buffer
2024-05-09 17:57:01.869 homeassistant bellows.ash DEBUG Received frame ErrorFrame(version=2, reset_code=<NcpResetCode.RESET_WATCHDOG: 3>)
2024-05-09 17:57:01.869 homeassistant bellows.ash DEBUG NCP has entered failed state: NcpResetCode.RESET_WATCHDOG
2024-05-09 17:57:01.870 homeassistant bellows.ezsp INFO NCP entered failed state. No application handler registered, ignoring...
2024-05-09 17:57:02.940 homeassistant bellows.ash DEBUG Received data 1ac1020b0a527e
2024-05-09 17:57:02.941 homeassistant bellows.ash DEBUG Received cancel byte, clearing buffer
2024-05-09 17:57:02.942 homeassistant bellows.ash DEBUG Received frame RStackFrame(version=2, reset_code=<NcpResetCode.RESET_SOFTWARE: 11>)
2024-05-09 17:57:02.944 homeassistant bellows.ezsp DEBUG Switching to EZSP protocol version 4
2024-05-09 17:57:02.945 homeassistant bellows.ezsp.protocol DEBUG Sending command  version: (4,)
2024-05-09 17:57:02.946 homeassistant bellows.ash DEBUG Sending frame DataFrame(frm_num=0, re_tx=False, ack_num=0, ezsp_frame=b'\x00\x00\x00\x04') + FLAG

Sequence 2, subsequent reconnect (1.0 seconds):

$ zigpy -vvv radio --baudrate 115200 ezsp /dev/serial/by-id/usb-Nabu_Casa_Home_Assistant_Connect_ZBT-1_9e2adbd75b8beb119fe564a0f320645d-if00-port0  info
2024-05-09 18:03:41.979 homeassistant zigpy.serial DEBUG Opening a serial connection to '/dev/serial/by-id/usb-Nabu_Casa_Home_Assistant_Connect_ZBT-1_9e2adbd75b8beb119fe564a0f320645d-if00-port0' (115200 baudrate)
2024-05-09 18:03:41.983 homeassistant bellows.ezsp DEBUG Resetting EZSP
2024-05-09 18:03:41.984 homeassistant bellows.uart DEBUG Resetting ASH
2024-05-09 18:03:41.984 homeassistant bellows.ash DEBUG Sending frame CANCEL + RstFrame() + FLAG
2024-05-09 18:03:41.985 homeassistant bellows.ash DEBUG Sending data  1ac038bc7e
2024-05-09 18:03:43.062 homeassistant bellows.ash DEBUG Received data 1ac1020b0a527e
2024-05-09 18:03:43.063 homeassistant bellows.ash DEBUG Received cancel byte, clearing buffer
2024-05-09 18:03:43.063 homeassistant bellows.ash DEBUG Received frame RStackFrame(version=2, reset_code=<NcpResetCode.RESET_SOFTWARE: 11>)
2024-05-09 18:03:43.065 homeassistant bellows.ezsp DEBUG Switching to EZSP protocol version 4
2024-05-09 18:03:43.066 homeassistant bellows.ezsp.protocol DEBUG Sending command  version: (4,)
2024-05-09 18:03:43.067 homeassistant bellows.ash DEBUG Sending frame DataFrame(frm_num=0, re_tx=False, ack_num=0, ezsp_frame=b'\x00\x00\x00\x04') + FLAG

Is there an unrecoverable third case? I see identical behavior with and without RTS/CTS flow control on both the SkyConnect and the Yellow, no matter what platform I use.

Nerivec commented 1 week ago

As far as I can tell, with Z2M add-on on HAOS, it goes like this:

Yellow with hardware flow control firmware, Z2M add-on newly started (manual or update):

rst > ERROR(RESET_WATCHDOG) > rst > RSTACK(RESET_SOFTWARE) > ack > RSTACK(RESET_SOFTWARE) (unexpected) > rst > RSTACK(RESET_SOFTWARE) > ack > started

(You can see the mocked debug log used to test against here, ignore the rtscts setting, it's just from a mocked port)


Yellow with software flow control firmware, any scenario:

rst > RSTACK(RESET_SOFTWARE) > ack > started


Note: Just to be clear, lowercase is sent to the adapter, uppercase is received from adapter

puddly commented 1 week ago

Do RStack() frames receive ACKs? I don't think it's expected to respond to anything but Data() frames with ACKs.

Nerivec commented 1 week ago

https://github.com/SiliconLabs/gecko_sdk/blob/d95099f4c05b3c40cece93d8a261dd471e642086/protocol/zigbee/app/ezsp-host/ash/ash-host.c#L352

According to Silabs' code yes (it sends on next send, before whatever is supposed to be sent, so here, right before version).

puddly commented 1 week ago

I'm trying this with exclusively SiLabs code (build/debug/ZigbeeMinimalHost -b115200 -fr -rr -t1 -x1 -pttyUSB0) running inside of a Docker container on a Yellow, connected to a SkyConnect running the latest EmberZNet. Just to be 100% sure I'm seeing exactly what's happening, I am bypassing the application entirely and just tracing syscalls directly with strace.

I've set up two ASH parsers, one for the RX side and one for the TX side, to see what serial traffic is being sent.

I've tried dozens of times to get anything to break and this is as broken as it gets:

OPENED SERIAL PORT
 ===> TX: 2024-05-09 18:40:01,427 macbook bellows.ash[1138] DEBUG Received data 1ac038bc7e
 ===> TX: 2024-05-09 18:40:01,427 macbook bellows.ash[1138] DEBUG Received cancel byte, clearing buffer
 ===> TX: 2024-05-09 18:40:01,427 macbook bellows.ash[1138] DEBUG Received frame RstFrame()

<===  RX: 2024-05-09 18:40:01,427 macbook bellows.ash[1138] DEBUG Received data 1ac20203d20a7e
<===  RX: 2024-05-09 18:40:01,427 macbook bellows.ash[1138] DEBUG Received cancel byte, clearing buffer
<===  RX: 2024-05-09 18:40:01,427 macbook bellows.ash[1138] DEBUG Received frame ErrorFrame(version=2, reset_code=<NcpResetCode.RESET_WATCHDOG: 3>)
<===  RX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG NCP has entered failed state: NcpResetCode.RESET_WATCHDOG

CLOSED SERIAL PORT
OPENED SERIAL PORT
 ===> TX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received data 1ac038bc7e
 ===> TX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received cancel byte, clearing buffer
 ===> TX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received frame RstFrame()

<===  RX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received data 1ac1020b0a527e
<===  RX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received cancel byte, clearing buffer
<===  RX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received frame RStackFrame(version=2, reset_code=<NcpResetCode.RESET_SOFTWARE: 11>)

 ===> TX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received data 8070787e004221a8597c057e
 ===> TX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received frame AckFrame(res=0, ncp_ready=0, ack_num=0)
 ===> TX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received frame DataFrame(frm_num=0, re_tx=0, ack_num=0, ezsp_frame=b'\x00\x00\x00\r')

 ===> TX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received data 1a084221a8597d5e287e
 ===> TX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received cancel byte, clearing buffer
 ===> TX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received frame DataFrame(frm_num=0, re_tx=1, ack_num=0, ezsp_frame=b'\x00\x00\x00\r')

<===  RX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received data 1ac1020b0a527e0142a1a8592835c6ade27e8160597e
<===  RX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received cancel byte, clearing buffer
<===  RX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received frame RStackFrame(version=2, reset_code=<NcpResetCode.RESET_SOFTWARE: 11>)
<===  RX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received frame DataFrame(frm_num=0, re_tx=0, ack_num=1, ezsp_frame=b'\x00\x80\x00\r\x02 t')
<===  RX: 2024-05-09 18:40:01,428 macbook bellows.ash[1138] DEBUG Received frame AckFrame(res=0, ncp_ready=0, ack_num=1)

Here is what happens on a freshly-connected SkyConnect:

 ===> TX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received data 1ac038bc7e
 ===> TX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received cancel byte, clearing buffer
 ===> TX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received frame RstFrame()

<===  RX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received data 1ac102029b7b7e
<===  RX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received cancel byte, clearing buffer
<===  RX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received frame RStackFrame(version=2, reset_code=<NcpResetCode.RESET_POWER_ON: 2>)

<===  RX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received data 1ac1020b0a527e
<===  RX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received cancel byte, clearing buffer
<===  RX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received frame RStackFrame(version=2, reset_code=<NcpResetCode.RESET_SOFTWARE: 11>)

 ===> TX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received data 8070787e004221a8597c057e
 ===> TX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received frame AckFrame(res=0, ncp_ready=0, ack_num=0)
 ===> TX: 2024-05-09 18:25:22,765 macbook bellows.ash[936] DEBUG Received frame DataFrame(frm_num=0, re_tx=0, ack_num=0, ezsp_frame=b'\x00\x00\x00\r')

<===  RX: 2024-05-09 18:25:22,766 macbook bellows.ash[936] DEBUG Received data 0142a1a8592835c6ade27e
<===  RX: 2024-05-09 18:25:22,766 macbook bellows.ash[936] DEBUG Received frame DataFrame(frm_num=0, re_tx=0, ack_num=1, ezsp_frame=b'\x00\x80\x00\r\x02 t')

 ===> TX: 2024-05-09 18:25:22,766 macbook bellows.ash[936] DEBUG Received data 8160597e7d314321a9fe2a04303a7e
 ===> TX: 2024-05-09 18:25:22,766 macbook bellows.ash[936] DEBUG Received frame AckFrame(res=0, ncp_ready=0, ack_num=1)
 ===> TX: 2024-05-09 18:25:22,766 macbook bellows.ash[936] DEBUG Received frame DataFrame(frm_num=1, re_tx=0, ack_num=1, ezsp_frame=b'\x01\x00\x01\xaa\x00\x11')

I've yet to get it into an unrecoverable state, however, or one that isn't fixed by the host sending a normal RstFrame(). I'll try with the Yellow alone tomorrow.

Nerivec commented 1 week ago

I'll let you know in the next few days if I still see reports of "unrecoverable state" with the new fixes. Hopefully not.

As for the watchdog... so sometimes it's ERROR(RESET_WATCHDOG) and sometimes it's the proper RSTACK(RESET_POWERON) (I say "proper" because that's what's seen on sw flow control -if you listen immediately on connect-, and, well, it makes sense). Did you notice a particular factor that triggers the ERROR one? If we can get rid of that, that'd solve everything I'm pretty sure 😄

zalatnaicsongor commented 5 days ago

Hi, just chiming in. Even with the latest zigbee-herdsman I still cannot start the ember driver with my ser2net (network) attached EFR32MG21 that's running the latest multi-protocol firmware (https://github.com/SiliconLabs/gecko_sdk/tree/v4.4.2).

Logs ``` [2024-05-14 15:47:21] info: z2m: Logging to console, file (filename: log.log) [2024-05-14 15:47:21] debug: z2m: Loaded state from file /config/zigbee2mqtt/state.json [2024-05-14 15:47:21] info: z2m: Starting Zigbee2MQTT version 1.37.1-dev (commit #d39baa6) [2024-05-14 15:47:21] info: z2m: Starting zigbee-herdsman (0.47.1) [2024-05-14 15:47:21] debug: z2m: Using zigbee-herdsman with settings: '{"adapter":{"concurrent":null,"delay":null,"disableLED":false},"backupPath":"/config/zigbee2mqtt/coordinator_backup.json","databaseBackupPath":"/config/zigbee2mqtt/database.db.backup","databasePath":"/config/zigbee2mqtt/database.db","network":{"channelList":[25],"extendedPanID":[102,209,175,253,160,186,38,57],"networkKey":"HIDDEN","panID":57484},"serialPort":{"adapter":"ember","path":"tcp://local-silabs-multiprotocol:9999","rtscts":true}}' [2024-05-14 15:47:21] debug: zh:ember: Using delay=5. [2024-05-14 15:47:21] debug: zh:controller: Starting with options '{"network":{"networkKeyDistribute":false,"networkKey":[190,37,51,65,237,144,64,223,60,174,114,3,141,117,84,153],"panID":57484,"extendedPanID":[102,209,175,253,160,186,38,57],"channelList":[25]},"serialPort":{"rtscts":true,"path":"tcp://local-silabs-multiprotocol:9999","adapter":"ember"},"databasePath":"/config/zigbee2mqtt/database.db","databaseBackupPath":"/config/zigbee2mqtt/database.db.backup","backupPath":"/config/zigbee2mqtt/coordinator_backup.json","adapter":{"disableLED":false,"concurrent":null,"delay":null}}' [2024-05-14 15:47:46] debug: zh:ember: Starting EZSP with stack configuration: "default". [2024-05-14 15:47:46] info: zh:ember:ezsp: ======== EZSP starting ======== [2024-05-14 15:47:46] info: zh:ember:uart:ash: ======== ASH NCP reset ======== [2024-05-14 15:47:46] debug: zh:ember:uart:ash: Opening TCP socket with local-silabs-multiprotocol:9999 [2024-05-14 15:47:46] debug: zh:ember:uart:ash: Socket connected [2024-05-14 15:47:46] info: zh:ember:uart:ash: Socket ready [2024-05-14 15:47:46] info: zh:ember:uart:ash: ======== ASH starting ======== [2024-05-14 15:47:46] debug: zh:ember:uart:ash: ---> [FRAME type=RST] [2024-05-14 15:47:46] debug: zh:ember:uart:ash: Waiting for RSTACK... 0/2500 [2024-05-14 15:47:46] debug: zh:ember:uart:ash: Waiting for RSTACK... 100/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 200/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 300/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 400/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 500/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 600/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 700/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 800/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 900/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 1000/2500 [2024-05-14 15:47:47] debug: zh:ember:uart:ash: Waiting for RSTACK... 1100/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 1200/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 1300/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 1400/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 1500/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 1600/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 1700/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 1800/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 1900/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 2000/2500 [2024-05-14 15:47:48] debug: zh:ember:uart:ash: Waiting for RSTACK... 2100/2500 [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 2200/2500 [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 2300/2500 [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 2400/2500 [2024-05-14 15:47:49] info: zh:ember:uart:ash: ======== ASH NCP reset ======== [2024-05-14 15:47:49] info: zh:ember:uart:ash: ======== ASH starting ======== [2024-05-14 15:47:49] debug: zh:ember:uart:ash: ---> [FRAME type=RST] [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 0/2500 [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 100/2500 [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 200/2500 [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 300/2500 [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 400/2500 [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 500/2500 [2024-05-14 15:47:49] debug: zh:ember:uart:ash: Waiting for RSTACK... 600/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 700/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 800/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 900/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 1000/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 1100/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 1200/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 1300/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 1400/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 1500/2500 [2024-05-14 15:47:50] debug: zh:ember:uart:ash: Waiting for RSTACK... 1600/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 1700/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 1800/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 1900/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 2000/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 2100/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 2200/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 2300/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 2400/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: <--- [FRAME type=RSTACK] [2024-05-14 15:47:51] info: zh:ember:uart:ash: ======== ASH connected ======== [2024-05-14 15:47:51] info: zh:ember:uart:ash: ======== ASH NCP reset ======== [2024-05-14 15:47:51] info: zh:ember:uart:ash: ======== ASH starting ======== [2024-05-14 15:47:51] debug: zh:ember:uart:ash: ---> [FRAME type=RST] [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 0/2500 [2024-05-14 15:47:51] debug: zh:ember:uart:ash: Waiting for RSTACK... 100/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 200/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 300/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 400/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 500/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 600/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 700/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 800/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 900/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 1000/2500 [2024-05-14 15:47:52] debug: zh:ember:uart:ash: Waiting for RSTACK... 1100/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 1200/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 1300/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 1400/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 1500/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 1600/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 1700/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 1800/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 1900/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 2000/2500 [2024-05-14 15:47:53] debug: zh:ember:uart:ash: Waiting for RSTACK... 2100/2500 [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 2200/2500 [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 2300/2500 [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 2400/2500 [2024-05-14 15:47:54] info: zh:ember:uart:ash: ======== ASH NCP reset ======== [2024-05-14 15:47:54] info: zh:ember:uart:ash: ======== ASH starting ======== [2024-05-14 15:47:54] debug: zh:ember:uart:ash: ---> [FRAME type=RST] [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 0/2500 [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 100/2500 [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 200/2500 [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 300/2500 [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 400/2500 [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 500/2500 [2024-05-14 15:47:54] debug: zh:ember:uart:ash: Waiting for RSTACK... 600/2500 [2024-05-14 15:47:55] debug: zh:ember:uart:ash: Waiting for RSTACK... 700/2500 [2024-05-14 15:47:55] debug: zh:ember:uart:ash: Waiting for RSTACK... 800/2500 [2024-05-14 15:47:55] debug: zh:ember:uart:ash: Waiting for RSTACK... 900/2500 [2024-05-14 15:47:55] debug: zh:ember:uart:ash: Waiting for RSTACK... 1000/2500 [2024-05-14 15:47:55] debug: zh:ember:uart:ash: Waiting for RSTACK... 1100/2500 [2024-05-14 15:47:55] debug: zh:ember:uart:ash: Waiting for RSTACK... 1200/2500 [2024-05-14 15:47:55] debug: zh:ember:uart:ash: Waiting for RSTACK... 1300/2500 [2024-05-14 15:47:55] debug: zh:ember:uart:ash: Waiting for RSTACK... 1400/2500 [2024-05-14 15:47:55] debug: zh:ember:uart:ash: Waiting for RSTACK... 1500/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 1600/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 1700/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 1800/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 1900/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 2000/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 2100/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 2200/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 2300/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 2400/2500 [2024-05-14 15:47:56] debug: zh:ember:uart:ash: <--- [FRAME type=RSTACK] [2024-05-14 15:47:56] info: zh:ember:uart:ash: ======== ASH connected ======== [2024-05-14 15:47:56] info: zh:ember:uart:ash: ======== ASH NCP reset ======== [2024-05-14 15:47:56] info: zh:ember:uart:ash: ======== ASH starting ======== [2024-05-14 15:47:56] debug: zh:ember:uart:ash: ---> [FRAME type=RST] [2024-05-14 15:47:56] debug: zh:ember:uart:ash: Waiting for RSTACK... 0/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 100/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 200/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 300/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 400/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 500/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 600/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 700/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 800/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 900/2500 [2024-05-14 15:47:57] debug: zh:ember:uart:ash: Waiting for RSTACK... 1000/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 1100/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 1200/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 1300/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 1400/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 1500/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 1600/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 1700/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 1800/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 1900/2500 [2024-05-14 15:47:58] debug: zh:ember:uart:ash: Waiting for RSTACK... 2000/2500 [2024-05-14 15:47:59] debug: zh:ember:uart:ash: Waiting for RSTACK... 2100/2500 [2024-05-14 15:47:59] debug: zh:ember:uart:ash: Waiting for RSTACK... 2200/2500 [2024-05-14 15:47:59] debug: zh:ember:uart:ash: Waiting for RSTACK... 2300/2500 [2024-05-14 15:47:59] debug: zh:ember:uart:ash: Waiting for RSTACK... 2400/2500 [2024-05-14 15:47:59] error: zh:ember: Failed to reset and init NCP. Error: Failed to start EZSP layer with status=HOST_FATAL_ERROR. [2024-05-14 15:48:01] debug: zh:ember:uart:ash: <--- [FRAME type=RSTACK] [2024-05-14 15:48:01] info: zh:ember:uart:ash: ======== ASH connected ======== [2024-05-14 15:48:06] warning: zh:ember:uart:ash: Frame(s) in progress cancelled in [1ac1020b0a527e] [2024-05-14 15:48:06] debug: zh:ember:uart:ash: <--- [FRAME type=RSTACK] [2024-05-14 15:48:06] error: zh:ember:uart:ash: Received unexpected reset from NCP, with reason=RESET_SOFTWARE. [2024-05-14 15:48:06] error: zh:ember:uart:ash: ASH disconnected: ASH_ERROR_NCP_RESET | NCP status: ASH_NCP_FATAL_ERROR [2024-05-14 15:48:06] error: zh:ember:uart:ash: Error while parsing received frame, status=HOST_FATAL_ERROR. ```

Let me know if anything I can do to give more information. Everything works perfectly with the ezsp driver.

Looking at the EZSP logs, it seems that recieveng an RSTACK takes 5 seconds after the RST request.

Nerivec commented 5 days ago

@zalatnaicsongor As mentioned in various places, multiprotocol is no longer supported. https://www.zigbee2mqtt.io/guide/adapters/emberznet.html


@puddly Seems 1.37.1 is able to work around the starting issue that resulted in failure (see example log below, for some reason, it's sending a NAK right after the RSTACK...). So, I guess, back to the original issue here, getting rid of that ERROR(RESET_WATCHDOG), so that hardware flow control can have as clean a start as software flow control.

Yellow log (omitted some "waiting" lines for brevity") ```logs [2024-05-10 13:04:13] info: zh:ember: ======== Ember Adapter Starting ======== [2024-05-10 13:04:13] debug: zh:ember: Starting EZSP with stack configuration: "default". [2024-05-10 13:04:13] info: zh:ember:ezsp: ======== EZSP starting ======== [2024-05-10 13:04:13] info: zh:ember:uart:ash: ======== ASH NCP reset ======== [2024-05-10 13:04:13] debug: zh:ember:uart:ash: Opening serial port with {"path":"/dev/ttyAMA1","baudRate":115200,"rtscts":true,"autoOpen":false,"parity":"none","stopBits":1,"xon":false,"xoff":false} [2024-05-10 13:04:13] info: zh:ember:uart:ash: Serial port opened [2024-05-10 13:04:13] info: zh:ember:uart:ash: ======== ASH starting ======== [2024-05-10 13:04:13] debug: zh:ember:uart:ash: ---> [FRAME type=RST] [2024-05-10 13:04:13] debug: zh:ember:uart:ash: Waiting for RSTACK... 0/2500 [2024-05-10 13:04:14] debug: zh:ember:uart:ash: Waiting for RSTACK... 500/2500 [2024-05-10 13:04:14] debug: zh:ember:uart:ash: <--- [FRAME type=ERROR] [2024-05-10 13:04:14] error: zh:ember:uart:ash: Received ERROR from NCP while connecting, with code=RESET_WATCHDOG. [2024-05-10 13:04:14] error: zh:ember:uart:ash: ASH disconnected | NCP status: ASH_NCP_FATAL_ERROR [2024-05-10 13:04:14] error: zh:ember:uart:ash: Error while parsing received frame, status=ASH_NCP_FATAL_ERROR. [2024-05-10 13:04:14] info: zh:ember:uart:ash: ======== ASH NCP reset ======== [2024-05-10 13:04:14] info: zh:ember:uart:ash: ======== ASH starting ======== [2024-05-10 13:04:14] debug: zh:ember:uart:ash: ---> [FRAME type=RST] [2024-05-10 13:04:14] debug: zh:ember:uart:ash: Waiting for RSTACK... 0/2500 [2024-05-10 13:04:15] debug: zh:ember:uart:ash: Waiting for RSTACK... 1000/2500 [2024-05-10 13:04:15] debug: zh:ember:uart:ash: <--- [FRAME type=RSTACK] [2024-05-10 13:04:15] info: zh:ember:uart:ash: ======== ASH connected ======== [2024-05-10 13:04:15] debug: zh:ember:uart:ash: <--- [FRAME type=NAK] [2024-05-10 13:04:15] debug: zh:ember:uart:ash: <--- [FRAME type=NAK ackNum=0] [2024-05-10 13:04:15] debug: zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=0] [2024-05-10 13:04:15] info: zh:ember:uart:ash: ======== ASH started ======== [2024-05-10 13:04:15] info: zh:ember:ezsp: ======== EZSP started ======== [2024-05-10 13:04:15] debug: zh:ember:ezsp: ===> [FRAME: ID=0:"VERSION" Seq=0 Len=4] [2024-05-10 13:04:15] debug: zh:ember:uart:ash: ---> [FRAME type=DATA frmTx=0 frmRx=0] [2024-05-10 13:04:15] debug: zh:ember:uart:ash: <--- [FRAME type=DATA] [2024-05-10 13:04:15] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1] [2024-05-10 13:04:15] debug: zh:ember:uart:ash: <--- [FRAME type=DATA ackNum=1 frmNum=0] Added to rxQueue [2024-05-10 13:04:15] debug: zh:ember:uart:ash: ---> [FRAME type=ACK frmRx=1] [2024-05-10 13:04:15] debug: zh:ember:ezsp: <=== [FRAME: ID=0:"VERSION" Seq=0 Len=7] ```
zalatnaicsongor commented 4 days ago

@zalatnaicsongor As mentioned in various places, multiprotocol is no longer supported. https://www.zigbee2mqtt.io/guide/adapters/emberznet.html

Is there any documentation of the issues with this? Zigbeed should implement the EZSP protocol. I'd like to take a look at fixing the issues if the code's open to contributions.

puddly commented 4 days ago

The issues aren't with the EZSP side, unfortunately. Zigbeed is built from the Silicon Labs Gecko SDK and the vast majority of the RCP firmware, the Zigbee stack itself, and host-side software are pre-compiled and linked straight from the SDK.

There are a few outstanding bugs that make the firmware unstable (see the Home Assistant addon issue tracker). Even with the addition of firmware-level watchdogs and disabling buggy features (multi-channel and ZGP), it still is extremely unstable for many users and crashes nearly constantly. We've made SiLabs aware of the issues and they're working through them but there's no timeframe for when all of the critical bugs will be fixed.