`ubus call poe info` status is randomly stuck on (cold/re)boot to 'unknown' for all ports

walterav1984 commented 2 years ago

ZyXEL GS1900-24HPv1 realtek-poe CI22 version

Sometimes it shows correct realtime status info after (cold/re) boot for the ubus call poe info and sometimes its stuck in unknown state for all ports although they are actually Delivering power fine with PoE devices powered on. A /etc/init.d/poe restart is the only remedy to make the status correct again with the downside of temporary disabling PoE.

mrnuke commented 2 years ago

I'm going to abstain from speculating until we can get some debug logs (realtek-poe -d).

walterav1984 commented 2 years ago

Trying to redirect the output from realtek-poe -d > output.log failed. Therefor I tried realtek-poe -d 2>&1 | grep X > realtek-poe-CI22-p2-24even.log and it logs the output.

The laptop was connected to port 3 all the time, 1 ssh session for log the 2nd ssh session for ubus call poe info. Than a single PoE device was connected and disconnected until the PoE comes on, for every even port starting at port 2 and finished on 24. After port 14 the ubus call poe info status hang as in it reported that port 14 was delevering power and port 16-24 kept Searching ... but ports 16-24 did power on the PoE although their status remaind Searching. After disconnecting the last PoE device at port 24 CTRL-C the deamon. Did the same test for uneven ports 1-23(laptop on port2), funny thing is this time ubus call poe info hangs correct reporting after port5 but like before port 7-23 power on PoE fine but status keeps Searching while port 5 stayed on Delevering Power.

even p2-p24 https://pastebin.com/j0TSm7Hx

uneven p1-p23 part1 / part2 pastebin upload size limit https://pastebin.com/wfWiht19 part1 11676~ https://pastebin.com/xyz2Tkjg part2 ~16526

mrnuke commented 2 years ago

Analysys

There's your problem:

TX -> 26 17 11 ff ff ff ff ff ff ff ff 46
RX <- fe ff ff ff ff ff ff ff ff ff ff f4
TX -> 30 18 11 ff ff ff ff ff ff ff ff 51
RX <- fe ff ff ff ff ff ff ff ff ff ff f4

MCU responds with packet fe - "incorrect checksum". There could be a number of reasons for that. Without getting into them, the issue is we're neither handling nor logging such errors. I've mentioned before that realtek-poe is propped up by 2x4s and held together by duct tape. This a good example of my statement.

Further down:

TX -> 26 ef 0c ff ff ff ff ff ff ff ff 19
RX <- ff ff ff ff ff ff ff ff ff f4 26 ef
TX -> 30 f0 0c ff ff ff ff ff ff ff ff 24
RX <- 0c 03 01 4d 02 0c ff ff ff 7d 30 f0

Here we become de-synchronized Our view of what we received is likely not what the MCU sent. There's probably a bunch of lost error messages about bad and mismatched packets. You might see this if you run:

# logread -te realtek-poe
Tue Jun 28 18:17:00 2022 [1656440220.499] daemon.err realtek-poe: received unsolicited reply

I've seen this before when two-instances of realtek-poe were running simultaneously. I doubt it's the case here since we're more than 3000 packets in before this happens.

Next steps

Not much we can do, but implement better error logging, and see:

[ ] Is there a pattern that triggers this?
[ ] Is the write buffer getting overwritten (unlikely) ?
[ ] Is there an programming mistake that's a one-line fix?
[ ] Is the MCU getting overwhelmed by requests?
[ ] Do we really need to get port config after we've already configured it?

walterav1984 commented 2 years ago

Capturing the real unknown status for all ports that happended before in the past is kinda hard if you don't start the realtek-poe service at start. Therefor I just started with the even/uneven logs in the previous post.

However all previous logs have been done by disabling the realtek-poe service and starting from cold-boot so no extra instance is running.

This time I do however have a log now of the funny sticky disabled ports 4 & 7 (possible other github issue #9 ), after reboot. So this time the output is after a reboot not coldoot. PoE power comes back on after reboot although no realtek-poe service is started so no ubus output but after starting realtek-poe -d this was the output.

ubus call poe info #after reboot PoE power come on automatically without realtek-poe service running 
#but as soon as realtek-poe -d started to get to ubus it resets all PoE power and this was the output
{
    "firmware": "v17.1",
    "mcu": "ST Micro ST32F100 Microcontroller",
    "budget": 170.000000,
    "consumption": 0.000000,
    "ports": {
        "lan1": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan2": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan3": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan4": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Disabled"   <--- Not disabled but PoE device won't power on again on this port!
        },
        "lan5": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan6": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan7": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Disabled"   <--- Not disabled but PoE device won't power on again on this port!
        },
        "lan8": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan9": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan10": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan11": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan12": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan13": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan14": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan15": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan16": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan17": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan18": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan19": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan20": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan21": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan22": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan23": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        },
        "lan24": {
            "priority": 2,
            "mode": "PoE+",
            "status": "Searching"
        }
    }
}

logread -te realtek-poe
Tue Jun 28 04:49:25 2022 [1656391765.496] daemon.err realtek-poe: received unsolicited reply
Tue Jun 28 04:49:25 2022 [1656391765.512] daemon.err realtek-poe: received unsolicited reply

https://pastebin.com/htF751nx

walterav1984 commented 2 years ago

Still on CI22 but when doing ~10 warm reboot(s), and starting realtek-poe -d with logging automatically from /etc/rc.local I haven't had any unknown ports so sadly no log file. However when trying to start normally by /etc/init.d/poe service I have encountered unknown ports state already in ~3 warm reboots. This is maybe because the init.d services are started earlier than rc.local?

Maybe related to boot timing or its just a timing issue at all like mentioned in that issue that resending failed commands may be good enough.

walterav1984 commented 2 years ago

Just tried CI29 and after the first warm reboot all ports show status unknown, although PoE is being delivered to (re)connected devices. Doing a /etc/init.d/poe reload doesn't fix the status.

However /etc/init.d/poe restart does fix the status again without the already attached PoE devices losing nor going down for Power this is new but favorable behavior that it won't lose power:+1: !

ubus call poe info
{
    "firmware": "v17.1",
    "mcu": "ST Micro ST32F100 Microcontroller",
    "budget": 170.000000,
    "consumption": 0.000000,
    "ports": {
        "lan1": {
            "priority": 2,
            "status": "unknown"
        },
        "lan2": {
            "priority": 2,
            "status": "unknown"
        },
        "lan3": {
            "priority": 2,
            "status": "unknown"
        },
        "lan4": {
            "priority": 2,
            "status": "unknown"
        },
        "lan5": {
            "priority": 2,
            "status": "unknown"
        },
        "lan6": {
            "priority": 2,
            "status": "unknown"
        },
        "lan7": {
            "priority": 2,
            "status": "unknown"
        },
        "lan8": {
            "priority": 2,
            "status": "unknown"
        },
        "lan9": {
            "priority": 2,
            "status": "unknown"
        },
        "lan10": {
            "priority": 2,
            "status": "unknown"
        },
        "lan11": {
            "priority": 2,
            "status": "unknown"
        },
        "lan12": {
            "priority": 2,
            "status": "unknown"
        },
        "lan13": {
            "priority": 2,
            "status": "unknown"
        },
        "lan14": {
            "priority": 2,
            "status": "unknown"
        },
        "lan15": {
            "priority": 2,
            "status": "unknown"
        },
        "lan16": {
            "priority": 2,
            "status": "unknown"
        },
        "lan17": {
            "priority": 2,
            "status": "unknown"
        },
        "lan18": {
            "priority": 2,
            "status": "unknown"
        },
        "lan19": {
            "priority": 2,
            "status": "unknown"
        },
        "lan20": {
            "priority": 2,
            "status": "unknown"
        },
        "lan21": {
            "priority": 2,
            "status": "unknown"
        },
        "lan22": {
            "priority": 2,
            "status": "unknown"
        },
        "lan23": {
            "priority": 2,
            "status": "unknown"
        },
        "lan24": {
            "priority": 2,
            "status": "unknown"
        }
    }
}

walterav1984 commented 2 years ago

First cold-boot with CI29 also shows all ports as unknown, its still random however sometimes it initializes the status for all ports correct and sometimes it does not.

Any idea how to log deamon output from /etc/init.d/poe?

mrnuke commented 2 years ago

Any idea how to log deamon output from /etc/init.d/poe?

logread -te tek-poe

walterav1984 commented 2 years ago

...snip
Mon Jul  4 05:09:54 2022 [1656911394.676] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.692] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.713] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.730] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.747] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.763] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.779] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.797] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.814] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.831] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.848] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.864] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.880] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.897] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.913] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.931] daemon.warn realtek-poe: received reply with bad checksum
Mon Jul  4 05:09:54 2022 [1656911394.948] daemon.warn realtek-poe: received reply with bad checksum
root@OpenWrt:~# logread -te tek-poe | wc
      836     10868     83600

A lot of lines with errors which keeps on going until I /etc/init.d/poe restart but I guess we need -d debug output how to catch that? This is my 3rd coldboot where all ports show unknown, maybe delay the service start?

mrnuke commented 2 years ago

Ugh! Error logging still sucks. Look higher up. There should be a different message that causes it all.

I'm looking for stuff like:

Mon Jul  4 04:34:55 2022 [] daemon.notice realtek-poe: MCU didn't like our command because incomplete request
Mon Jul  4 04:34:55 2022 [] daemon.notice realtek-poe:    CMD:    13 2c 05 02 ff ff ff ff ff ff ff 3f
Mon Jul  4 04:34:55 2022 [] daemon.notice realtek-poe:    reply:  fd ff ff ff ff ff ff ff ff ff ff f3

walterav1984 commented 2 years ago

First I had to increase the logd size in /etc/config/system to save all messages that the daemon creates but I haven't encounted anything other than a single instance like wrong command id and the rest is bad checksum. Will try to log the bad command id soon but as far as I remember there wasn't a reference to a number or hex string.

mrnuke commented 2 years ago

I had to increase the logd size in /etc/config/system to save all messages that the daemon creates

Seems like upgrading that message to a warning worked a lot better in my head. Got the message back down in priority with CI#34.

I'm still scratching my head at how to fix these 'bad checksum' packets

walterav1984 commented 2 years ago

Just tried CI34 but this package is broken, it won't start the init.d service therefor no ubus call poe info and realtek-poe -d errors with:

realtek-poe -d #CI32,CI33,CI34
realtek-poe: TX -> 20 01 ff ff ff ff ff ff ff ff ff 18
realtek-poe: RX <- 20 01 02 18 00 e1 11 11 00 01 01 40
Bus error

Tue Jul  5 02:24:30 2022 kern.info kernel: [    0.491720] realtek_gpio_probe probing RTL GPIO
Tue Jul  5 02:24:30 2022 kern.info kernel: [    0.507005] gpio-464 (poe_enable): hogged as output/high
Tue Jul  5 02:25:08 2022 daemon.info procd: Instance poe::instance1 s in a crash loop 6 crashes, 0 seconds since last crash
Tue Jul  5 02:27:53 2022 daemon.debug realtek-poe: TX -> 20 01 ff ff ff ff ff ff ff ff ff 18
Tue Jul  5 02:27:53 2022 daemon.debug realtek-poe: RX <- 20 01 02 18 00 e1 11 11 00 01 01 40

EDIT1: Versions CI32-CI33-CI34 give the same error.

mrnuke commented 2 years ago

Bus error

I'm sorry. I introduced a double-free() when rebasing, and didn't catch it. Adresses that snafu in CI35

walterav1984 commented 2 years ago

Some early testing of the CI41 package already shows the unknown port status for all ports just a couple of minutes after install of the realtek-poe package. This never happended before with earlier versions only after (cold) (re)boots the status would hang sometimes for all ports but not already being booted/in operation.

PoE port 1 comes up by default /etc/config/poe than I reloaded the service to apply the /etc/config/poe 24port config file and PoE stays up on port 1 and starts powering on PoE port 24 and after a few ubus call poe info all ports start to show unknown and then the service crashes.

logread -te realtek-poe
Tue Jul  5 02:42:56 2022 [1656988976.376] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul  5 02:42:56 2022 [1656988976.377] daemon.notice realtek-poe:    CMD:    30 19 15 ff ff ff ff ff ff ff ff 56
Tue Jul  5 02:42:56 2022 [1656988976.377] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul  5 02:42:56 2022 [1656988976.408] kern.info kernel: [  326.849351] do_page_fault(): sending SIGSEGV to realtek-poe for invalid write access to 00000004
Tue Jul  5 02:42:56 2022 [1656988976.408] kern.info kernel: [  326.859340] epc = 00401c11 in realtek-poe[400000+3000]
Tue Jul  5 02:42:56 2022 [1656988976.408] kern.info kernel: [  326.865308] ra  = 00401bf5 in realtek-poe[400000+3000]
Tue Jul  5 02:43:11 2022 [1656988991.117] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul  5 02:43:11 2022 [1656988991.117] daemon.notice realtek-poe:    CMD:    26 b1 0d ff ff ff ff ff ff ff ff dc
Tue Jul  5 02:43:11 2022 [1656988991.117] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul  5 02:43:11 2022 [1656988991.158] kern.info kernel: [  341.591542] do_page_fault(): sending SIGSEGV to realtek-poe for invalid write access to 00000004
Tue Jul  5 02:43:11 2022 [1656988991.159] kern.info kernel: [  341.601601] epc = 00401c11 in realtek-poe[400000+3000]
Tue Jul  5 02:43:11 2022 [1656988991.159] kern.info kernel: [  341.607574] ra  = 00401bf5 in realtek-poe[400000+3000]
Tue Jul  5 02:43:17 2022 [1656988997.788] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul  5 02:43:17 2022 [1656988997.788] daemon.notice realtek-poe:    CMD:    13 4e 09 02 ff ff ff ff ff ff ff 65
Tue Jul  5 02:43:17 2022 [1656988997.788] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul  5 02:43:17 2022 [1656988997.812] kern.info kernel: [  348.252526] do_page_fault(): sending SIGSEGV to realtek-poe for invalid write access to 00000004
Tue Jul  5 02:43:17 2022 [1656988997.812] kern.info kernel: [  348.262565] epc = 00401c11 in realtek-poe[400000+3000]
Tue Jul  5 02:43:17 2022 [1656988997.812] kern.info kernel: [  348.268530] ra  = 00401bf5 in realtek-poe[400000+3000]
Tue Jul  5 02:43:26 2022 [1656989006.641] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul  5 02:43:26 2022 [1656989006.641] daemon.notice realtek-poe:    CMD:    30 bc 00 ff ff ff ff ff ff ff ff e4
Tue Jul  5 02:43:26 2022 [1656989006.641] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul  5 02:43:26 2022 [1656989006.664] kern.info kernel: [  357.105607] do_page_fault(): sending SIGSEGV to realtek-poe for invalid write access to 00000004
Tue Jul  5 02:43:26 2022 [1656989006.664] kern.info kernel: [  357.115625] epc = 00401c11 in realtek-poe[400000+3000]
Tue Jul  5 02:43:26 2022 [1656989006.664] kern.info kernel: [  357.121476] ra  = 00401bf5 in realtek-poe[400000+3000]
Tue Jul  5 02:43:34 2022 [1656989014.069] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul  5 02:43:34 2022 [1656989014.069] daemon.notice realtek-poe:    CMD:    10 73 0e 03 ff ff ff ff ff ff ff 8d
Tue Jul  5 02:43:34 2022 [1656989014.069] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul  5 02:43:34 2022 [1656989014.070] kern.info kernel: [  364.510667] do_page_fault(): sending SIGSEGV to realtek-poe for invalid write access to 00000004
Tue Jul  5 02:43:34 2022 [1656989014.070] kern.info kernel: [  364.520721] epc = 00401c11 in realtek-poe[400000+3000]
Tue Jul  5 02:43:34 2022 [1656989014.070] kern.info kernel: [  364.526690] ra  = 00401bf5 in realtek-poe[400000+3000]

Although the service crashed the PoE mechanism is still active and delivering PoE to all ports.

walterav1984 commented 2 years ago

Although the status hangs on unknown for all ports it will still reflect the correct /etc/config/poe disabled/enabled PoE as in disabling or enabling a PoE port in the config and doing a /etc/init.d/poe reload will still power on or power off a PoE device on that port. Only the display status stays frozen for all those ports until /etc/init.d/poe restart.

mrnuke commented 2 years ago

Hey, @walterav1984 . I'm sorry about the crashes in CI41. I had split up the logging into a separate PR, and when I rebased the retry mechanism, I accidentally introduced a use-after free.

The retry mechanism is still in "draft" because there are some issues with this solution. I have another solution in mind (#19) that and I'm hoping to have ready for testing soon. No point in testing that yet, since the CI build doesn't include logging of error packets.

To address your log and why ports worked:

: MCU didn't like our command because request-bad-checksum : CMD: 30 19 15 ff ff ff ff ff ff ff ff 56

This is the "port power stats" command. If we're sending this, then most likely, we've configured the ports. So I would expect the ports to work even though the status isn't updated.

mrnuke commented 2 years ago

@walterav1984, can you please give CI #44 a spin ? This should have both error logging, and reduced setup packet count.

walterav1984 commented 2 years ago

Tried CI44 and did around ~30 PoE device port swaps without a problem as in device status searching/requesting/delevering power was all correct. However already at first reboot all ports were stuck at unknown status.

Stupid enough I did not log the errors and went straight to CI45 which does contain fix CI44 if I'm not mistaken.

Doing 2 fine reboots with CI45, the 3rd reboot all ports are stuck again at status unknown see log.

#OPENWRT_RELEASE="OpenWrt 22.03.0-rc4 r19426-2b1941e47d"
$ logread -te realtek-poe
Tue Jul 19 20:56:59 2022 [1658264219.873] daemon.notice realtek-poe: MCU rejected command: request-bad-checksum
Tue Jul 19 20:56:59 2022 [1658264219.874] daemon.notice realtek-poe:    CMD:    18 08 04 06 a4 00 aa ff ff ff ff 74
Tue Jul 19 20:56:59 2022 [1658264219.874] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul 19 20:57:01 2022 [1658264221.938] daemon.err realtek-poe: received unsolicited reply

Doing 2 fine reboots also with CI46, the 3rd reboot all ports are stuck again at status unknown see log.

#OPENWRT_RELEASE="OpenWrt 22.03.0-rc5 r19523-bfd070e7fa"
$ logread -te realtek-poe
Tue Jul 19 21:04:57 2022 [1658264697.251] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul 19 21:04:57 2022 [1658264697.251] daemon.notice realtek-poe:    CMD:    15 25 14 01 15 01 16 01 17 01 ff 93
Tue Jul 19 21:04:57 2022 [1658264697.251] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul 19 21:05:00 2022 [1658264700.105] daemon.err realtek-poe: received unsolicited reply
Tue Jul 19 21:05:00 2022 [1658264700.105] daemon.err realtek-poe: received unsolicited reply
Tue Jul 19 21:05:00 2022 [1658264700.111] daemon.err realtek-poe: received unsolicited reply

Doing a /etc/init.d/poe restart instead of reload fixes the status again, although some errors still follow up in the log afterwards when doing some PoE device swaps.

Tue Jul 19 21:05:00 2022 [1658264700.105] daemon.err realtek-poe: received unsolicited reply
Tue Jul 19 21:05:00 2022 [1658264700.105] daemon.err realtek-poe: received unsolicited reply
Tue Jul 19 21:05:00 2022 [1658264700.111] daemon.err realtek-poe: received unsolicited reply
Tue Jul 19 21:08:18 2022 [1658264898.417] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul 19 21:08:18 2022 [1658264898.417] daemon.notice realtek-poe:    CMD:    30 47 09 ff ff ff ff ff ff ff ff 78
Tue Jul 19 21:08:18 2022 [1658264898.417] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul 19 21:11:49 2022 [1658265109.092] daemon.err realtek-poe: Aborting request (30) after 11 attempts
Tue Jul 19 21:11:49 2022 [1658265109.092] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul 19 21:11:49 2022 [1658265109.092] daemon.notice realtek-poe:    CMD:    26 ed 15 ff ff ff ff ff ff ff ff 20
Tue Jul 19 21:11:49 2022 [1658265109.092] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul 19 21:13:46 2022 [1658265226.361] daemon.err realtek-poe: Aborting request (28) after 11 attempts
Tue Jul 19 21:13:54 2022 [1658265234.320] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul 19 21:13:54 2022 [1658265234.320] daemon.notice realtek-poe:    CMD:    28 4b 14 01 15 01 16 01 17 01 ff cc
Tue Jul 19 21:13:54 2022 [1658265234.320] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4
Tue Jul 19 21:14:11 2022 [1658265251.101] daemon.err realtek-poe: Aborting request (26) after 11 attempts
Tue Jul 19 21:14:11 2022 [1658265251.135] daemon.notice realtek-poe: MCU didn't like our command because request-bad-checksum
Tue Jul 19 21:14:11 2022 [1658265251.135] daemon.notice realtek-poe:    CMD:    30 29 12 ff ff ff ff ff ff ff ff 63
Tue Jul 19 21:14:11 2022 [1658265251.135] daemon.notice realtek-poe:    reply:  fe ff ff ff ff ff ff ff ff ff ff f4

EDIT: Altering the /etc/init.d/poe START=80 value to START=99 also won't affect the status, maybe otherway to delay?

mrnuke commented 2 years ago

CI#44 is the same as CI#45. When the pull request was merged, it triggered a new CI, which became CI#45.

I don't think the START= value in /etc/init.d/poe has anything to do with this issue. realtek-poe is spending most of its time sleeping, waiting for slow serial IO. Even if the system is dead busy, I can't imagine any delays being significant in terms of how far in time the packets are spaced.

I also think this is more likely to occur on a cold boot. We're not sending a "global disable" command, so if ports have already been configured, there's less work for the MCU to do. This would be consistent with the hypothesis that the MCU is overloaded.

Could this also be a completely random transmission failure? A slow optoisolator design? Possibly, but unlikely. Might be a good idea to read out the serial port error counters, if they exist.

Hurricos commented 2 years ago

Even if the system is dead busy, I can't imagine any delays being significant in terms of how far in time the packets are spaced.

I can test this in practice by reading the serial line, though I am wondering whether something more sensitive than a SaleaeLogic clone would make sense here: perhaps when the system is busy, other data lines near UART1 are causing noise to spread.

A similar idea is that the RTL838x might not be that well-constructed, or that this implementation of it (the GS1900-24HPv1 board) is not ideal for operation when the system is busy doing other things. Note that the SoC package can optionally switch from UART1 from pins 116 and 117 to a separate SPI line (based on the GMII_INTF_SET register). I hae no doubts that this is set correctly, I'm just making the point that it might not be the cleanest line implementation on the board,and the operation of other peripherals might be harming communication b/w the MCU and ttyS1.

It's clear that delaying when realtek-poe starts also lowers the rate of problems, we can reproduce that. How, is a different question, and how we fix that is yet another (given the existence of #15 anyhow).

mrnuke commented 2 years ago

A similar idea is that the RTL838x might not be that well-constructed,

If we're looking for signal integrity issues, we need an oscilloscope, not logic analyzer. Does the signal reach he correct voltage level? Does it cross the isolation barrier cleanly? We don't see these issues on on the vendor firmware, right? What do you think are the chances of this being an analog problem at a rate of 19200?

mrnuke commented 2 years ago

Does this still happen with CI57 ?

mrnuke commented 1 year ago

Given we haven't seen any complaints about this in the past two weeks, I am closing this as resolved.

walterav1984 commented 1 year ago

Is it still usefull to test some of these late CI builds on RC 22.03, since ~~official 22.03,~~ master snapshot packages contains most of these upstreamed patches if I'm not mistaken?

mrnuke commented 1 year ago

Only if you see issues with the package in 22.03.

walterav1984 commented 1 year ago

Just tested 22.03 with the realtek-poe package from snapshot since I didn't find one yet in the 22.03 release.

After the first cold boot, all ports are stuck on status 'unknown' again and poe power output is not working for those configured ports. The usual /etc/init.d/poe restart fixes not only the status but also the actual poe power output for the configured ports.

Did patch CI57 already land in realtek-poe_382c60e7-1_mips_4kec.ipk or is it still usefull to test?

A workaround for me would probably be a couple of seconds sleep and a poe restart in '/etc/rc.local'.

mrnuke commented 1 year ago

I see. I need to update the packages feed. CI57 should fix this in the meantime. Please make a note of it here if it doesn't.

mrnuke commented 1 year ago

Actually, CI 58 is the latest and greatest (realtek-poe v1.0)

walterav1984 commented 1 year ago

Just tried CI58 which still manifests the unknown port status bug already after first reboot:

ubus call poe info
{
    "firmware": "v17.1",
    "mcu": "ST Micro ST32F100 Microcontroller",
    "budget": 170.000000,
    "consumption": 0.000000,
    "ports": {
        "lan1": {
            "priority": 2,
            "status": "unknown"
        },
        "lan2": {
            "priority": 2,
            "status": "unknown"
        },
        "lan3": {
            "priority": 2,
            "status": "unknown"
        },
        "lan4": {
            "priority": 2,
            "status": "unknown"
        },
        "lan5": {
            "priority": 2,
            "status": "unknown"
        },
        "lan6": {
            "priority": 2,
            "status": "unknown"
        },
        "lan7": {
            "priority": 2,
            "status": "unknown"
        },
        "lan8": {
            "priority": 2,
            "status": "unknown"
        },
        "lan9": {
            "priority": 2,
            "status": "unknown"
        },
        "lan17": {
            "priority": 2,
            "status": "unknown"
        }
    }
}

Is there a way to start the /etc/init.d/poe with the debug parameter, so I can show debug info?

mrnuke commented 1 year ago

logread -te tek-poe

walterav1984 commented 1 year ago

After first coldboot this is the output:

logread -te poe
Sat Sep  3 03:53:14 2022 [1662177194.317] kern.info kernel: [    0.511414] gpio-464 (poe_enable): hogged as output/high
Sat Sep  3 03:53:28 2022 [1662177208.826] daemon.err realtek-poe: received unsolicited reply

Dont' mind the poe config I used, just enabled ports 1-8 +9 +17 so nothing strange there...

mrnuke commented 1 year ago

logread -te poe
Sat Sep  3 03:53:14 2022 [1662177194.317] kern.info kernel: [    0.511414] gpio-464 (poe_enable): hogged as output/high
Sat Sep  3 03:53:28 2022 [1662177208.826] daemon.err realtek-poe: received unsolicited reply

Looks like I need to add a command to dump the state machine status. Meanwhile, you can edit /etc/init.d/poe to add the -d argument to realtek-poe -- but don't make a habit out of it.

walterav1984 commented 1 year ago

Did some test in the past and recently with realtek-poe -d but I couldn't get it to work from /etc/init.d/ because it may need some stdout/stderror redirect foo?

mrnuke commented 1 year ago

Even simpler: edit /etc/init.d/poe, reboot and do logread -te tek-poe when you see the issue on boot.

walterav1984 commented 1 year ago

Did try that earlier with single/double ' " ` or even rewriting in variable $() but starting realtek-poe from /etc/init.d/poe won't start at all as soon as there is a parameter/argument added to the program in that script.

Starting it from /etc/rc.local is possible however that one is ran later during the boot so it didn't trigger the unknown port issue.

mrnuke commented 1 year ago

Does this not work? procd_set_param command /usr/bin/realtek-poe -d

walterav1984 commented 1 year ago

:man_facepalming:

It does indeed work with debug output see pastebin. This is after first reboot with no PoE devices connected only management on eth3.

mrnuke commented 1 year ago

Gah! pastebin is a minefield of advertisements and trackers.

TLDR

This will take a bit to enginerd a code fix. A temporary workaround is to make realtek-poe start up later in the boot process.

Analysis

root@OpenWrt:~# logread -te poe
Sat Sep  3 04:06:34 2022 [1662177994.933] daemon.debug realtek-poe: TX -> 30 04 01 ff ff ff ff ff ff ff ff 2d
Sat Sep  3 04:06:34 2022 [1662177994.954] daemon.debug realtek-poe: RX <- ff ff 7b 30 04 01 00 00 00 00 00 cf
Sat Sep  3 04:06:34 2022 [1662177994.954] daemon.debug realtek-poe: received reply with bad checksum

That's problematic because the actual packet the MCU sent is 30 04 01 00 .... It's interfpreted as ff ff 7b 30 ... because there are some extra bytes in the RX buffer from earlier on. Unfortunately, the log is truncated, and we don't see the original cause.

What's happening is an off-by-n error in interpreting packets from the stream. It's very likely the MCU was still transmitting something when realtek-poe started, and that threw us off. If we had a serial trace, we'd likely see the bootloader send commands, then realtek-poe start up as the MCU is still replying.

The fix is a better mechanism to handle checksum errors. That needs to account for both extra bytes in the stream (current shituation), and bytes missing. The latter needs to discard the entire packet, which affects the reply handler as well.

walterav1984 commented 1 year ago

Unfortunately, the log is truncated, and we don't see the original cause.

:man_facepalming:

To quote myself... and repeat the past!

First I had to increase the logd size in /etc/config/system to save all messages

Here a full log from coldboot with unknown status: https://github.com/walterav1984/realtek-poe/commit/e1fed0ff0ff7ac7ec4ecec6eb94787ad440c6a8e

mrnuke commented 1 year ago

Thank you! Actually, the full log is even more interesting.

Initialization starts off and proceeds fine:

Sat Sep  3 04:12:05 2022 daemon.debug realtek-poe: TX -> 18 09 05 06 a4 00 aa ff ff ff ff 76
Sat Sep  3 04:12:05 2022 daemon.debug realtek-poe: RX <- 18 09 05 00 ff ff ff ff ff ff ff 1f

And then the MCU issues an error:

Sat Sep  3 04:12:06 2022 daemon.notice realtek-poe: MCU rejected command: request-bad-checksum
Sat Sep  3 04:12:06 2022 daemon.notice realtek-poe:     CMD:    18 0a 06 06 a4 00 aa ff ff ff ff 78
Sat Sep  3 04:12:06 2022 daemon.notice realtek-poe:     reply:  fe ff ff ff ff ff ff ff ff ff ff f4

So far so good. That packet is something realtek-poe can handle. However, our problems start right after that:

Sat Sep  3 04:12:06 2022 daemon.debug realtek-poe: RX <- fe 00 ff ff ff ff ff ff ff ff ff ff
Sat Sep  3 04:12:06 2022 daemon.debug realtek-poe: received reply with bad checksum
Sat Sep  3 04:12:06 2022 daemon.debug realtek-poe: RX <- fe ff ff ff ff ff ff ff ff f4 fe ff
Sat Sep  3 04:12:06 2022 daemon.debug realtek-poe: received reply with bad checksum

Here, bytes start missing. I'm not sure about the 00 byte. That could be a break condition

The fix is likely what I already discussed. I have some ideas. Reopening issue.

Hurricos / realtek-poe