espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.65k stars 7.29k forks source link

After connecting to the MQTT broker, after a few hours, the message "esp-tls: couldn't get hostname for :xxxxxx.ap-northeast-1.amazonaws.com: getaddrinfo() returns 202, addrinfo=0x0" suddenly appears, and the connection is disconnected. (IDFGH-10734) #11952

Closed imammy-hacomono closed 1 year ago

imammy-hacomono commented 1 year ago

Answers checklist.

IDF version.

v5.0.1

Operating System used.

macOS

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-WROOM-32E

Power Supply used.

External 5V

What is the expected behavior?

"E (25774312) esp-tls: couldn't get hostname for :xxxxx-ats.iot.ap-northeast-1.amazonaws.com: getaddrinfo() returns 202, addrinfo=0x0" will no longer occur.

What is the actual behavior?

We are using the W5500 Ethernet chip for MQTT communication with AWS IoT over Ethernet. We have loaded the necessary certificates onto the device, and the connection with the MQTT broker works fine initially.

However, after maintaining the connection for some time, we encounter the following error:

"E (25774312) esp-tls: couldn't get hostname for :xxxxx-ats.iot.ap-northeast-1.amazonaws.com: getaddrinfo() returns 202, addrinfo=0x0"

Could you please explain what might be causing this issue and if there are any workarounds to resolve it?

Steps to reproduce.

  1. Step
  2. Step
  3. Step ...

Debug Logs.

No response

More Information.

No response

imammy-hacomono commented 1 year ago

Immediately before, the following error occurred:

E (8542591) w5500.mac: emac_w5500_alloc_recv_buf(551): invalid frame length 39 E (8542591) w5500.mac: no mem for receive buffer

After executing the Disconnect function from the coreMQTT library, when attempting to reconnect, the following error is encountered:

E (8632331) esp-tls: couldn't get hostname for :xxxxxxxxxx-ats.iot.ap-northeast-1.amazonaws.com: getaddrinfo() returns 202, addrinfo=0x0

Of course, the issue can be resolved by rebooting, but since it occurs once every few hours, I don't want to restart every time.

kostaond commented 1 year ago

frame length 39 would indicate runt frames (Ethernet frame shorter than 60B). However, W5500 should not forward this type of corrupted frames. Are you able to ping the device when this error occurs?

imammy-hacomono commented 1 year ago

Hi @kostaond, You responded to my issue, and I want to say thank you.

I have no idea why the corrupted frames are being attempted to be forwarded.

I am using the esp-aws-iot library to set up an interface for Ethernet communication and perform MQTT communication. However, I don't have much knowledge about low-level implementations, and I am really struggling with the issue.

https://github.com/espressif/esp-aws-iot/blob/1fc7681778bc271960a4e3db514a209df0380917/examples/ota/ota_http/main/ota_demo_core_http.c#L1051C13-L1051C13

I will try to test if ping is executable. By the way, do you have any suggestions for a possible solution?

kostaond commented 1 year ago

Unfortunately, I don't have any possible solution yet since I have no idea what's going on. We need to reproduce the issue first or isolate the issue. I'll need your help with that.

Do you always observe the same w5500 error? Do you observe other SPI communication errors? Would you please provide steps and ideally minimum code example to reproduce the issue?

imammy-hacomono commented 1 year ago

@kostaond

I'm sorry, providing the code is a bit difficult. However, I'll do my best to assist you with what I can. I keep encountering the same W5500 error every time. E (8542591) w5500.mac: emac_w5500_alloc_recv_buf(551): invalid frame length 39

The procedure is as follows.

Using the esp-aws-iot library, connect to the MQTT broker and leave it idle. (It appears that the library is using MQTT Ping to maintain communication and prevent disconnection due to Keep Alive.) After completing the W5500 initialization and confirming the connection to the network, Set the TLS communication interface to the transport interface of the coreMQTT library and start communication . (after configuring certificates, etc.).

TransportInterface_t transport = { 0 };
transport.send = espTlsTransportSend;
transport.recv = espTlsTransportRecv;

/* Initialize the MQTT library. */
mqttStatus = MQTT_Init(pMqttContext,
                       &transport,
                       Clock_GetTimeMs,
                       mqttEventCallback,
                       &networkBuffer);

Also, in the esp-idf v5.1 release, there were some important fixes related to Ethernet. After changing the IDF version from v5.0.1 to v5.1, the number of errors seems to have significantly decreased. The aging period is still relatively short, but it might be worth trying. You can find the release notes here: https://github.com/espressif/esp-idf/releases/tag/v5.1

Added

Added Multi-interface VLAN support example 

Changed

Simplified Ethernet examples initialization
Updated DM9051 configuration to receive multicast packets
Report error if chip version is not expected.

Fixed

Reduced possibility of "no mem for receive buffer" error.
Fixed clearing of incorrect registry prior servicing the interrupt in ENC28J60 
Fixed issue when DM9051 was stopped, it was not properly started and there was no Ethernet communication 
Fixed issue when ESP32 EMAC could hang when stopped/started multiple times at 10Mbps speed mode 

Removed

Removed -Wno-format in affected examples, changed type specifiers to match types which are logged in protocols, Ethernet, network

Unfortunately, another issue has occurred where an assertion is triggered in esp_bignum. I have opened an issue for this problem (https://github.com/espressif/esp-idf/issues/12003).

kostaond commented 1 year ago

Are you able to ping the device when this error occurs?

imammy-hacomono commented 1 year ago

@kostaond

E (404317) w5500.mac: emac_w5500_alloc_recv_buf(535): invalid frame length 10

When this error occurs, ping to this device is no longer successful!

kostaond commented 1 year ago

@imammy-hacomono you closed the issue. Have you found the root cause? If so, could you please share with us?

imammy-hacomono commented 1 year ago

@kostaond

By setting the clock speed to 13 MHz, the problem no longer occurs. Runt frames no longer occur. (Actually, I would like to raise the clock a little more.) It is difficult to go any deeper than this, so I would like to leave it to Espressif to determine the cause of the problem.

Thanks for the support!

spi_device_interface_config_t spi_devcfg = {
        .command_bits = 16, // Actually it's the address phase in W5500 SPI frame
        .address_bits = 8,  // Actually it's the control phase in W5500 SPI frame
        .spics_io_num = GPIO_NUM_5,
        .mode = 0,
        .clock_speed_hz = SPI_MASTER_FREQ_13M,
        .queue_size = 20
    };
kostaond commented 1 year ago

@imammy-hacomono thank you for the info. I'll keep this issue in mind. However, it's hard to make any conclusion when we haven't been able to reproduce :/

Just note that the SPI frequency 13 MHz may not be actually set. Please check https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/peripherals/spi_master.html#spi-clock-frequency