celestiaorg / celestia-core

A fork of CometBFT
Apache License 2.0
483 stars 263 forks source link

DNS TTL Not Respected by celestia-core Leading to Potential Future Sync Issues #1425

Closed smuu closed 1 month ago

smuu commented 2 months ago

Description:

We identified a potential issue where changes to the DNS entries of nodes in celestia-core are not respected, similar to the issue observed in celestia-node. Although we have not yet encountered sync issues due to this behavior, it could pose a problem in the future. It appears that nodes resolve DNS entries only once at startup and continue using the same IP address indefinitely, ignoring DNS TTL.

Steps to Reproduce:

  1. Change the DNS entries for nodes.
  2. Observe that nodes continue to use the old IP address without re-resolving the DNS entries.

Suspected Cause: Nodes resolve DNS entries only once at startup and continue using the same IP address without respecting the TTL. This could affect:

Relevant Code:

Potential Fix:

  1. Periodically re-resolve DNS entries based on the TTL.
  2. Update active connections if the resolved IP address changes.

Repositories Potentially Needing Changes:

Impact: Not respecting DNS TTL can lead to potential connectivity and sync issues, affecting network reliability in the future.

Request for Assistance:

  1. Implement periodic DNS resolution based on TTL.
  2. Test changes to ensure nodes dynamically update connections based on DNS updates.
smuu commented 2 months ago

For reference the issue in celestia-node: https://github.com/celestiaorg/celestia-node/issues/3570

smuu commented 2 months ago

One workaround for this issue would be to recreate the connection once it fails after the IP address changes. This way, we don't need to add support to handle the DNS TTL, and the node would request the new IP address from the DNS server.

cmwaters commented 2 months ago

Yeah, we could add code such that if the connection failed it would try resolve the DNS again to see if there had been an IP change and if so reconnect to the peer. Ideally we'd only try connect if the error specifically had to do with the network connection and not some malicious behaviour from the peer