Hurricos / realtek-poe

5 stars 10 forks source link

realtek-poe almost always fails to configure final ports due to busy system at early boot #13

Closed Hurricos closed 1 year ago

Hurricos commented 2 years ago

Issue identification

On reboot, I pretty much always see:

root@OpenWrt:~# ubus call poe info | grep Disabled -A 1 -B 3
        "lan15": {
            "priority": 2,
            "mode": "PoE",
            "status": "Disabled"
        },
--
        "lan18": {
            "priority": 2,
            "mode": "PoE",
            "status": "Disabled"
        },
        "lan19": {
            "priority": 2,
            "mode": "PoE",
            "status": "Disabled"
        },
        "lan20": {
            "priority": 2,
            "mode": "PoE",
            "status": "Disabled"
        },
        "lan21": {
            "priority": 2,
            "mode": "PoE",
            "status": "Disabled"
        },
        "lan22": {
            "priority": 2,
            "mode": "PoE",
            "status": "Disabled"
        },
        "lan24": {
            "priority": 2,
            "mode": "PoE",
            "status": "Disabled"
        }
Hurricos commented 2 years ago

Debugging

I suspect this is because attempts to write out to /dev/ttyS1 are failing because a lot is going on at this point in the boot process.

Ultimately: it seems this is because the MCU just isn't ready at the given point in boot (see below).

Solutions

mrnuke commented 2 years ago

I can repro the issue without needing to reboot. It appears problem is related to sending too many packets too fast to the MCU. Eventually it replies with "not ready".

I've been experimenting with a mechanism to re-queue failed commands. So far, that's the most promising lead.

mrnuke commented 2 years ago

No PR exists yet for this code, but to demo the issue:

root@engenius-2910:~# /tmp/realtek-poe 
realtek-poe: MCU didn't like our command because incomplete request
realtek-poe:    CMD:    13 2c 05 02 ff ff ff ff ff ff ff 3f
realtek-poe:    reply:  fd ff ff ff ff ff ff ff ff ff ff f3
realtek-poe: Ate 1 replies
realtek-poe: Ate 0 replies
realtek-poe: Chomp/chomp/chomp
realtek-poe: RX <- fd ff ff ff ff ff ff ff ff ff ff f3
realtek-poe: MCU didn't like our command because incomplete request
realtek-poe:    CMD:    13 2c 05 02 ff ff ff ff ff ff ff 3f
realtek-poe:    reply:  fd ff ff ff ff ff ff ff ff ff ff f3
realtek-poe: Ate 1 replies
realtek-poe: Chomp/chomp/chomp
realtek-poe: RX <- fd ff ff ff ff ff ff ff ff ff ff f3
realtek-poe: MCU didn't like our command because incomplete request
realtek-poe:    CMD:    13 2c 05 02 ff ff ff ff ff ff ff 3f
realtek-poe:    reply:  fd ff ff ff ff ff ff ff ff ff ff f3
realtek-poe: Ate 1 replies
realtek-poe: TIMEOUT rearmed (73 pending)
realtek-poe: TIMEOUT rearmed (128 pending)
realtek-poe: TIMEOUT rearmed (183 pending)

After a number of commands, the reply becomes:

 fd ff ff ff ff ff ff ff ff ff ff f3

Without us sending any more packets, the MCU keeps sending this packet a few times. There isn't any indication that it's ready other than no longer sending this packet. there's no "Okay, I'm ready now message". It's what makes this ordeal so much more fun.

mrnuke commented 2 years ago

With #19 in, I suspect the issue of ports not being correctly configured is gone.

Quick explanation:

I suspect that the MCU has some sort of internal queue for processing requests. We used to send 7 or 8 setup packets for each port. I think we were spamming that queue with setup requests. This wasn't an issue for 8 ports, but when we tripled that to 24, the MCU couldn't handle it.

We got the number of packets down significantly. From ~170 to ~60 packets on a 24-port. And I think the MCU can handle this more reasonably.

Other failures:

If the log shows failures for CMD: 23, 26, 28, or 30, then the ports have been correctly configured. These latter commands are used for querying the status. If we reach these safely, then things are configured.

If the only failures we see in the log are from commands 23, 26, 28, or 30, then this issue can be considered fixed.

mrnuke commented 2 years ago

Does this still happen with CI57 ?

mrnuke commented 1 year ago

Given we haven't seen any complaints about this in the past two weeks, I am closing this as resolved.