earlephilhower / arduino-pico

Raspberry Pi Pico Arduino core, for all RP2040 boards
GNU Lesser General Public License v2.1
1.92k stars 400 forks source link

DHCP renew issues > 3.6.0 #1809

Closed Ing-Dom closed 9 months ago

Ing-Dom commented 9 months ago

with core 3.6.0 I used some code like that

void IPConfigModule::loop(bool configured)
{
    if(!delayCheckMillis(_lastLinkCheck, 500))
        return;

    uint8_t newLinkState = KNX_NETIF.isLinked();

    // got link
    if(newLinkState && !_linkstate)
    {

        logInfoP("LAN Link established.");
        netif_set_link_up(KNX_NETIF.getNetIf());
    }

    // lost link
    else if(!newLinkState && _linkstate)
    {
        netif_set_link_down(KNX_NETIF.getNetIf());
        logInfoP("LAN Link lost.");
    }
    _linkstate = newLinkState;
    _lastLinkCheck = millis();
}

the purpose was:

and it worked finde with the core 3.6.0 (despite the isLinked() issue after some while). Also confirmed with Wireshark !

I am pretty sure that behaviour is a standard behaviour and it is implemented within lwip, see https://github.com/lwip-tcpip/lwip/blob/239918ccc173cb2c2a62f41a40fd893f57faf1d6/src/core/netif.c#L1018-L1049 and https://github.com/lwip-tcpip/lwip/blob/239918ccc173cb2c2a62f41a40fd893f57faf1d6/src/core/ipv4/dhcp.c#L897-L925

since the fix of the isLinked issue with this commit https://github.com/earlephilhower/arduino-pico/commit/ea936f00e1732b9fc861c81f33c216075f03b03c this is not working anymore.

when detecting link up in the loop, there is no ip address assigned, and I don't see any DHCP messages from the device (previously, I did). It works when link is up when powering up.

I can't really explain, but I checked it twice by running the exact same program, the only difference is platform_packages = framework-arduinopico @ https://github.com/earlephilhower/arduino-pico.git#8ac616e and platform_packages = framework-arduinopico @ https://github.com/earlephilhower/arduino-pico.git (which is #ea936f0)

I can't really explain why, but that is what I think: the code causing the "lags" which also disappeard with #ea936f0 was also the code handling some event-driven stuff in lwip, such as renewing dhcp... and now, it seems that code is not executed anymore, or, at least in some different manner.

maybe, something to do with LWIP_ASSERT_CORE_LOCKED ? but it seems to be defined empty in your lwipopts.h ?

earlephilhower commented 9 months ago

While I believe you're seeing what you're reporting, the change in #1805 is completely isolated from LWIP and the isLinked() function itself is never used in the core/libraries proper. The locking is only around the SPI calls which do not know about LWIP, either. It is very likely you experiences memory corruption w/o this locking.

Per the LWIP docs, it seems that you need to explicitly tell DHCP to restart once a link/network change has occurred: https://www.nongnu.org/lwip/2_0_x/dhcp_8c.html#a04f3824720223c439165243527906002

After link up, can you add a dhcp_network_changed(KNX_NETIF.getNetIf());

Ing-Dom commented 9 months ago

I'll give it a try, but netif_set_link_up should call dhcp_network_changed anyway.

I don't understand this. dhcp_network_changed_link_up(KNX_NETIF.getNetIf()); works.

But: netif_set_link_up calls dhcp_network_changed_link_up also - why it isn't working? unfortunately, I'm not able to debug lwip, ozone does not find the source in /home/earle/....

netif_set_link_up(struct netif *netif)
{
  LWIP_ASSERT_CORE_LOCKED();

  LWIP_ERROR("netif_set_link_up: invalid netif", netif != NULL, return);

  if (!(netif->flags & NETIF_FLAG_LINK_UP)) {
    netif_set_flags(netif, NETIF_FLAG_LINK_UP);

#if LWIP_DHCP
    dhcp_network_changed_link_up(netif);
#endif /* LWIP_DHCP */

https://github.com/lwip-tcpip/lwip/blob/239918ccc173cb2c2a62f41a40fd893f57faf1d6/src/core/netif.c#L1018-L1049 ??

earlephilhower commented 9 months ago

That's darn odd. If this is stable, it should be reproducible on any wired Ethernet setup just by calling link down/up. It's in the queue here after #1812...

Ing-Dom commented 9 months ago

I got the debugger running with some path replacement, I'll try to find out something. Do you know how I can enable LWIP debug trace?

earlephilhower commented 9 months ago

In the libpico directory there's a LWIP config file you can edit. You'll also need to enable Tools->Debug Port->Serial (or else printf won't output anything). With that configured you can get some really verbose dumps.

Ing-Dom commented 9 months ago

so lwip is NOT precompiled? These changes really have an effect?

earlephilhower commented 9 months ago

LWIP/SDK are precompiled (in multiple forms). Edit the header and rerun the ./make-libpico.sh script to make it work. The cmake SDK and the Arduino build system are definitely not compatible. 😢 Sorry if not clear!

Ing-Dom commented 9 months ago

no, there is a lot ongoing in the core that is absolutley not clear to me ;)

I use Plattform IO, but with -DDEBUG_RP2040_PORT=Serial I was able to enable the debug output.

But I only get this stuff, no lwip message. Whats clear now, because lwip is precompiled. unfortunately I'm on a windows machine and not really familiar with setting up cmake etc...

:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 18
:urd 18, 18, 0
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 21
:urd 21, 21, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 18
:urd 18, 18, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 18
:urd 18, 18, 0
:urn 19
:urd 19, 19, 0
:urn 21
:urd 21, 21, 0
:urn 17
:urd 17, 17, 0
:urd 17, 17, 0
:urd 17, 17, 0
:urn 17
:urch 45, 18
:urd 17, 17, 0
:urd 18, 18, 0
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 17
:urd 17, 17, 0
:urn 17
:urd 17, 17, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
:urn 19
:urd 19, 19, 0
earlephilhower commented 9 months ago

Expand and overwrite the files in your lib directory with these prebuilt .as: dbg.zip and you should start getting verbose output like:

etharp_find_entry: no empty entry found and not allowed to recycle
etharp_input: incoming ARP request
etharp_input: ARP request was not for us.
pbuf_free(0x2000ca98)
pbuf_free: deallocating 0x2000ca98
pbuf_alloc(length=60)
pbuf_alloc(length=60) == 0x2000ca98
ethernet_input: dest:ff:ff:ff:ff:ff:ff, src:c4:2f:90:1c:eb:16, type:806
pbuf_remove_header: old 0x2000caa8 new 0x2000cab6 (14)
etharp_update_arp_entry: 192.168.1.22 - c4:2f:90:1c:eb:16
etharp_find_entry: found empty entry 0
etharp_find_entry: no empty entry found and not allowed to recycle
etharp_input: incoming ARP request
etharp_input: ARP request was not for us.
pbuf_free(0x2000ca98)
pbuf_free: deallocating 0x2000ca98
Ing-Dom commented 9 months ago

got it running, thanks! There is now a lot ongoing on the serial, I'll need to setup and test my cases, then I will report back.

Ing-Dom commented 9 months ago

console_lwipdebug_core_4625d2c_without_extradhcp4.txt console_lwipdebug_core_4625d2c_without_extradhcp4 txt This is a trace with the current master branch of the core, and with your modified binaries. Ethernet is linked on startup. The first DHCP works. And is visible in wireshark. After a unplug-plug (LAN Link established." Line 1799 netif_set_link_up is called from my application. lwip is doing some DHCP stuff, and if I understand it right, sends some DHCP packes. But they don't show up on Wireshark, nor are they answered.

console_lwipdebug_core_4625d2c_without_extradhcp5_nolinkstartup.txt console_lwipdebug_core_4625d2c_without_extradhcp5_nolinkstartup txt This is a trace with the current master branch of the core, and with your modified binaries. Ethernet is DOWN on startup. After pluging (LAN Link established." Line 261 netif_set_link_up is called from my application. lwip is doing some DHCP stuff, and if I understand it right, sends some DHCP packets. But they don't show up on Wireshark, nor are they answered.

earlephilhower commented 9 months ago

....

lwip is doing some DHCP stuff, and if I understand it right, sends some DHCP packes. But they don't show up on Wireshark, nor are they answered.

I see in console_lwipdebug_core_4625d2c_without_extradhcp4.txt line 1942 the link going up and multiple DHCP requests trying to be sent out.

I think the state of the interface is such that it drops any packets sent its way on the floor.

Try adding either KNX_NETIF.getNetIf()->flags |= NETIF_FLAG_UP; or netif_set_up(KNX_NETIF.getNetIf()); immediately after you netif_set_link_up.

Per the docs: https://www.nongnu.org/lwip/2_0_x/group__netif.html#gaf19693be401a265a52d2a56c65753121

Bring an interface up, available for processing traffic.

Ing-Dom commented 9 months ago

and now I added an additional dhcp_network_changed_link_up

    // got link
    if(newLinkState && !_linkstate)
    {

        logInfoP("LAN Link established.");
        netif_set_link_up(KNX_NETIF.getNetIf());

        // work around TODO
        dhcp_network_changed_link_up(KNX_NETIF.getNetIf());
    }
that results in the correct behaviour:

console_lwipdebug_core_4625d2c_with_extradhcp1.txt console_lwipdebug_core_4625d2c_with_extradhcp1 txt

there is an additional dhcp_reboot() directly after the first (i think triggered by the add. dhcp_network_changed_link_up ) and there is an answer...

Do you agree, that it SHOULD work with no additional dhcp_network_changed_link_up? Because netif_set_link_up calls dhcp_network_changed_link_up anyway?

Ing-Dom commented 9 months ago

I haven't recognized your reply before posting this, I read it now.

earlephilhower commented 9 months ago

I'm not an expert, but to even get DHCP working on initial setup there is an add'l step required to set the netif_up flag or DHCP requests will not be sent out. So I think you really do need a multi-step restart after a link down event to get things working.

I think the core and LWIP are all behaving as intended, and this is just one of those LWIP quirks that people run into.

Ing-Dom commented 9 months ago

you are right. After a unplug - plug and calling netif_set_link_down / up no communication is possible. I does also not respond to ping anymore. netif_set_up(KNX_NETIF.getNetIf()); doesn't change the behaviour.

but, as I wrote before, dhcp_network_changed_link_up does.

I'll try unplug - plug now WITHOUT calling netif_set_link_xxx ...

Ing-Dom commented 9 months ago

I think the core and LWIP are all behaving as intended, and this is just one of those LWIP quirks that people run into.

that may be the case... The only thing what is really strange that it worked with only calling set_link_up/down with 3.6.0. And after the spi fix, not anymore. I really would like to understand that, but I likely never will find out ;)

Ing-Dom commented 9 months ago

If I find out how to do this correctly, I think the link state handling should be located in the driver. PR regarding that welcome?

earlephilhower commented 9 months ago

Sure, if you think it actually makes a difference.

Right now you're the only person ever taking a link down/up (that includes the PicoW wifi driver) so I'm not quite sure that all your work is needed. But, if it doesn't break things then it would be more "correct" which is a nice property to have. 😆

earlephilhower commented 9 months ago

I think we're done with this debug. Closing for now.