celestiaorg / celestia-node

Celestia Data Availability Nodes
Apache License 2.0
919 stars 917 forks source link

DNS TTL Not Respected by celestia-node Leading to Sync Issues #3570

Open smuu opened 1 month ago

smuu commented 1 month ago

Description:

We encountered an issue where changes to the DNS entries of DA nodes in Arabica caused light nodes to fail to sync. Restarting the light nodes resolved the issue, indicating that they resolve DNS once at startup and then use the same IP address indefinitely, ignoring DNS TTL.

Steps to Reproduce:

  1. Change the DNS entries for nodes.
  2. Observe light nodes failing to sync.
  3. Restart the light nodes and observe they can sync.

Suspected Cause: Light nodes resolve DNS entries only once at startup and continue using the same IP address without respecting the TTL. This affects both:

Relevant Code:

Potential Fix:

  1. Periodically re-resolve DNS entries based on the TTL.
  2. Update active connections if the resolved IP address changes.

Repositories Potentially Needing Changes:

Impact: Not respecting DNS TTL can lead to connectivity and sync issues, affecting network reliability.

Request for Assistance:

  1. Identify where DNS resolution is handled in the codebase and dependencies.
  2. Implement periodic DNS resolution based on TTL.
  3. Test changes to ensure nodes dynamically update connections based on DNS updates.
ramin commented 1 month ago

i think simplest here is to remove https://github.com/celestiaorg/celestia-node/blob/main/libs/utils/address.go#L40 which resolves the IP a single time on start, instead letting clients use the domain as passed in, and relying on the infra of the internet to work

unless there was there good a reason we HAD to resolve IP?

smuu commented 1 month ago

i think simplest here is to remove https://github.com/celestiaorg/celestia-node/blob/main/libs/utils/address.go#L40 which resolves the IP a single time on start, instead letting clients use the domain as passed in, and relying on the infra of the internet to work

unless there was there good a reason we HAD to resolve IP?

If I understand the code correctly, this would only resolve one part of the issue: the connection between the DA BN and the consensus node. From my understanding, this code is not called when resolving the DNS in a multiaddr.

smuu commented 1 month ago

One workaround for this issue would be to recreate the connection once it fails after the IP address changes. This way, we don't need to add support to handle the DNS TTL, and the node would request the new IP address from the DNS server.

renaynay commented 1 month ago

What's status on this @ramin @smuu ?