espressif / arduino-esp32

Arduino core for the ESP32
GNU Lesser General Public License v2.1
12.97k stars 7.29k forks source link

ETH PHY disconnects *very* frequently whenever AP or STA is active #7796

Open 0x0fe opened 1 year ago

0x0fe commented 1 year ago

Board

ESP32 Dev Module

Device Description

Proprietary board based on ESP32-WROOM-32E and RTL8201F, using APLL on GPIO17.

Hardware Configuration

PHY CLK from APLL via GPIO17.

Version

latest master (checkout manually)

IDE Name

arduino IDE

Operating System

windows 10

Flash frequency

80M

PSRAM enabled

no

Upload speed

921600

Description

The ethernet link is active, IP set by DHCP, and the standard client test is used (GET on google), when AP and STA are OFF, there is no problem, all client calls are all processed correctly, there is no ethernet disconnects. However, when AP or when STA is enabled, ethernet link disconnections occurs at pretty much every client calls as can be seen below. While i admit it is dubious to enable STA mode at the same time as ethernet, i tested it because the clients told they "configure" the STA "without connecting it" (whatever this means). So, i had to test. On the other hand i see no reason why the AP mode would cause any problem, it can be useful or necessary to have the AP running at the same time as ethernet and i dont see why or how they could interfere with each other.

The board power rails are as follow : DC 5V power in -> 600R 0805 FB -> 1 22uF 50V electrolytic + 10uF 16V ceramic -> [RT9013] ->600R 0805 FB -> 2 22uF 16V -> ESP32 VDD ->600R 0805 FB -> 1 22uF 16V -> RTL8201F AVDD ->1 22uF 16V -> RTL8201F DVDD

Of course there are multiple 100nF X7R decoupling caps all over, in particular before and after the LDO, and 6 around the RTL8201F, etc.

The RT9013 can provide 500mA @ 3.3V with good PSRR. So while 500mA does not let so much margin in peaks i doubt it is a problem of available current, provided all the storage available on the board; i will still verify that with scope and sourcemeter, i did not get any brownout either.

I suspect this is related to the TCP/IP handling or the ETH driver.

``

Sketch

#include <Arduino.h>
#include <ETH.h>

//#define ETH_CLK_MODE    ETH_CLOCK_GPIO0_IN  
#define ETH_CLK_MODE    ETH_CLOCK_GPIO17_OUT
#define ETH_TYPE        ETH_PHY_RTL8201 
#define ETH_POWER_PIN   5
#define ETH_ADDR        0x01 
#define ETH_MDC_PIN     23
#define ETH_MDIO_PIN    18

IPAddress voidAddr = IPAddress(0,0,0,0);
static bool eth_connected=false;
unsigned long c=0;

void EthEvent(WiFiEvent_t event){

  switch (event)
  {
    case ARDUINO_EVENT_ETH_START:
      Serial.println("ETH Started");
      ETH.setHostname("esp32-ethernet");
    break;
    case ARDUINO_EVENT_ETH_CONNECTED:
      Serial.println("ETH Connected");
    break;
    case ARDUINO_EVENT_ETH_GOT_IP:
      Serial.print("ETH MAC: ");
      Serial.println(ETH.macAddress());
      Serial.print("IPv4: ");
      Serial.println(ETH.localIP());
      if (ETH.fullDuplex()){Serial.print("FULL_DUPLEX");}
      Serial.print(" ");
      Serial.print(ETH.linkSpeed());
      Serial.println("Mbps");
      eth_connected = true;
    break;
    case ARDUINO_EVENT_ETH_DISCONNECTED:
      Serial.println("\nETH Disconnected");
      eth_connected = false;
    break;
    case ARDUINO_EVENT_ETH_STOP:
      Serial.println("ETH Stopped");
      eth_connected = false;
    break;
    default:
    break;
  }
}
void testClient(const char *host, uint16_t port) {

  Serial.print("connect ");
  Serial.print(host);
  WiFiClient client;
  if(!client.connect(host, port)) {
    Serial.println(" ... failed");
    return;
  }
  client.printf("GET / HTTP/1.1\r\nHost: %s\r\n\r\n", host);
  while (client.connected() && !client.available()) delay(1);
  //while (client.available()) Serial.write(client.read());
  while (client.available()){ client.read();  }
  Serial.println(" ... closing");
  client.stop();
}
void wifi_sta(void){

  WiFi.begin("SFR_6200","inlandser");
  Serial.print("Connecting");
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
    Serial.print(".");
  }
  Serial.println();
  Serial.print("Wifi connected, IP address: ");
  Serial.println(WiFi.localIP());
}
void wifi_ap(void){

  WiFi.mode(WIFI_AP);
  WiFi.softAP("ESP32_AP");
  Serial.print("AP Created with IP Gateway ");
  Serial.println(WiFi.softAPIP());
}
void wifi_off(void){

  WiFi.disconnect(true);
  WiFi.mode(WIFI_OFF);  
}
void setup() {

  Serial.begin(115200);
  Serial.println("system init");
  wifi_off();
  //wifi_sta();
  wifi_ap();
  WiFi.onEvent(EthEvent);
  ETH.begin(ETH_ADDR,ETH_POWER_PIN,ETH_MDC_PIN,ETH_MDIO_PIN,ETH_TYPE,ETH_CLK_MODE);
  // trick to force DHCP re-acquisition https://github.com/espressif/arduino-esp32/issues/7795
  ETH.config(voidAddr,voidAddr,voidAddr,voidAddr);
}
void loop() {

  if(eth_connected && c++==50) {
    testClient("www.google.com", 80);
    c=0;
  }
  delay(5);
}

Debug Message

***************with only STA active**************
system init
Connecting.
Wifi connected, IP address: 192.168.1.34
ETH Started
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com
ETH Disconnected
 ... closing
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com
ETH Disconnected
 ... closing
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com
ETH Disconnected
 ... closing
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com
ETH Disconnected
 ... closing
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com
ETH Disconnected
 ... closing
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com ... closing

ETH Disconnected
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com ... closing

***************with only AP active**************
system init
AP Created with IP Gateway 192.168.4.1
ETH Started
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com
ETH Disconnected
[  3786][E][WiFiClient.cpp:268] connect(): socket error on fd 48, errno: 113, "Software caused connection abort"
 ... failed
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com
ETH Disconnected
[ 21786][E][WiFiClient.cpp:268] connect(): socket error on fd 48, errno: 113, "Software caused connection abort"
 ... failed

***************with WiFi OFF**************
system init
ETH Started
ETH Connected
ETH MAC: 94:B5:55:6B:5D:B3
IPv4: 192.168.1.37
FULL_DUPLEX 100Mbps
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing
connect www.google.com ... closing

Other Steps to Reproduce

No response

I have checked existing issues, online documentation and the Troubleshooting Guide

SuGlider commented 1 year ago

@0x0fe - Interesting issue.

A few quick thoughts:

0x0fe commented 1 year ago

When ESP32 has ETH and STA working, how should it work? What is the default route? IDF has no router software implemented. Therefore, it may be necessary for the application to do all the routing.

When ETH and STA are UP, we should ideally be able to define at the client level which link is used, or a priority. Two hypothetical scenarios as illustration:

If we could define which link should be used by which client instance it would be easy to setup.

When ESP32 has ETH and AP working, what is the default router address? What should it work?

When ETH and AP are UP the router address should be the one defined by the DHCP on ETH, the AP should serve whatever device connects to it (if a client or server is attached to this link). As for wether the client connecting to the AP should be able to access the WAN/router address is outside the scope of this issue ticket, but i guess it should be allowed too if ever the current tcp/ip stack can support this.

ESP32 IDF lwIP is really handling more than one network interface? I think that one may be messing the other.

lwIP can have multiple network interfaces, i did not work on it since years so i really can't recall on how it is handled internally and what routing is allowed though. I agree that this is likely the root cause of the disconnections. Also, STA and AP disturb the ETH link in a different way, AP will throw the error [ 3786][E][WiFiClient.cpp:268] connect(): socket error on fd 48, errno: 113, "Software caused connection abort" while STA does not throw errors.

0x0fe commented 1 year ago

so, i guess that would be how lwIP binds one interface to one socket https://github.com/espressif/arduino-esp32/blob/371f382db7dd36c470bb2669b222adf0a497600d/tools/sdk/esp32/include/lwip/lwip/src/include/lwip/api.h#L330

TD-er commented 1 year ago

Keep in mind that whenever you turn off the WiFi, the entry for the DNS server is erased.

Have you tested to only enable the STA/AP or STA+AP interface, without starting a network connection? A very useful use case is to act as an ESP-NOW gateway. If this is working fine, then I guess you may rule out a power issue.

On my boards using the LAN8720 chip (is different, I know), I have isolated nets for power for the LAN chip and the rest. Just to make sure high frequency signals won't cause strange issues. For this I used a single and relatively beefy 0 Ohm resistor to connect both GND planes and a Ferrite Bead with some capacitors to filter out the high frequency between 3V3 nets. The LAN8720 does show strange issues when the power supply isn't adequate (that's why I also added this filter), which can be seen by the connected speed. It then switches between 10 Mbps and 100 Mbps. Maybe that's also happening here with your board? Can you also try by forcing it to 10 Mbps? (N.B. do not try half-duplex, that's not working well, at least not on the LAN8720)

0x0fe commented 1 year ago

there was no hardware issue, what you describe is what was actually implemented (star topology with FBs) it is standard practice for PHY. Single connection point for AGND is also standard practice, i would not advise to mix the PHY AGND and DGND though. The issues here were solely related to how LWIP is implemented on ESP32, it is not intended to have two interfaces active at the same time. LWIP support it, but the espressif SDK doesnt.

TD-er commented 1 year ago

Not 100% sure if this is related and if not, I will open a new issue for this.

I've been troubleshooting on one of my nodes here which is running the LAN8720A. I have 2 versions of these. One is having an external clock for Ethernet and one is using the internal clock of the ESP via GPIO-17 for the LAN. Also important is that I am running also the WiFi radio continuously since these boards act as a gateway for lots and lots of ESP-NOW packets from other nodes.

On previous builds this was running fine on both boards, but on the recent builds I noticed that only the board with external crystal was working with LAN. The one using the internal clock wasn't.

My logic analyzer clearly showed the clock was missing on GPIO-17, where it was present on the older builds.

I've now managed to get it to work, but I have yet to revert all my steps to see if only the last step was the real cause, or maybe other steps were also helping in finally solving this issue.

It seems like you really need to start the Eth device soon after boot and make sure you don't start WiFi before calling ETH.begin(). I do register the WiFi callback functions, but only after calling:

WiFi.disconnect(false);

and turning the WiFi off explicitly.

What I did before was, I performed a WiFi scan, then turned WiFi off and then continued to start the Ethernet. This did add a delay and I got the impression that maybe some timer was already taking the clock timer. (maybe some LEDc call to PWM a LED???)

A few of the logs I got then:

E (14723) esp.emac: emac_esp32_init(349): reset timeout
E (14724) esp_eth: esp_eth_driver_install(214): init mac failed
E (15330) esp_netif_lwip: esp_netif_new: Failed to configure netif with config=0x3ffb2660 (config or if_key is NULL or duplicate key)
E (15332) esp.emac: esp_eth_mac_new_esp32(595): alloc emac interrupt failed

15108 : Info   : ETH PHY Type: LAE (15577) esp_netif_lwip: esp_netif_new: Failed to configure netif with config=0x3ffb26e0 (config or if_key is NULL or duplicate key)
E (15589) esp.emac: esp_eth_mac_new_esp32(595): alloc emac interrupt failed
N8710/LAN8720 PHY Addr: 0 Eth Clock mode: 50MHz APLL Inverted Output on GPIO17 MDC Pin: 23 MIO Pin: 18 Power Pin: 12

What I also noticed was that in such setups, if I got it to detect the link being up, that the clock wasn't always present. (the short interruption around 100s is a reboot)

image

As you can see, it only shows a clock for about 10sec.

So I wonder, is it possible the ESP may sometimes loose its interrupt to the clock? Since you're also using the same GPIO pin as clock, I thought it might be interesting to test whether it makes a difference if you initialize Ethernet first and then WiFi.

mrengineer7777 commented 1 year ago

In my project I always try ETH first on a cold boot (detected with RESET_REASON rr = rtc_get_reset_reason(0);). If I don't get a LAN connection in 10s then I save a flag to NVS, reboot and start WiFi. Not pretty but works reliably.

mrengineer7777 commented 1 year ago

@0x0fe did you figure out the issue? Was it hardware or software?

0x0fe commented 1 year ago

@mrengineer7777 image

This issue is due to lwip not arbitrating multiple active I/F, lwip can handle multiple active interfaces but there must be some additional work to define/handle priorities (i guess), which is not done in the espressif API (unless this has recently changed).

I told the client about this issue, my task was only to design the hardware, validate it and provide functional test firmware, they were in charge of the application layer, anyway it is quite simple to manage once you are aware of the issue.

VojtechBartoska commented 10 months ago

@SuGlider can you please help with the triage, is this something we can improve or no action points are needed? Thanks

bwjohns4 commented 2 months ago

Did anyone see Eth and WiFi successfully working together?