celestiaorg / celestia-node

Celestia Data Availability Nodes
Apache License 2.0
909 stars 893 forks source link

docker: Potential issues with bootstrapping when running node in docker on macos #3533

Open vlad2095 opened 4 days ago

vlad2095 commented 4 days ago

Celestia Node version

0308aea76857ff27484946fce99004ebf10a3cb8 / v0.13.7

OS

ghcr.io/celestiaorg/celestia-node container on MacOS M1.

Install tools

docker/podman

Others

No response

Steps to reproduce it

celestia light init --p2p.network celestia --core.ip rpc.celestia.pops.one celestia light start --p2p.network celestia --core.ip rpc.celestia.pops.one

Expected result

successfully running light node

Actual result

Error: node: failed to start: header/p2p: failed to open a new stream: failed to dial: failed to dial 12D3KooWQpuTFELgsUypqp9N4a1rKBccmrmQVY8Em9yhqppTJcXf: all dials failed

Relevant log output

2024-06-18T06:08:16.254Z        INFO    node    nodebuilder/module.go:26        Accessing keyring...
2024-06-18T06:08:16.275Z        INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/levels.go:171   All 1 tables opened in 1ms
2024-06-18T06:08:16.276Z        INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/discard.go:66   Discard stats nextEmptySlot: 1
2024-06-18T06:08:16.276Z        INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:368       Set nextTxnTs to 143
2024-06-18T06:08:17.150Z        INFO    pidstore        pidstore/pidstore.go:67 Loaded peers from disk  {"amount": 0}
2024-06-18T06:08:17.150Z        INFO    module/header   header/config.go:66     No trusted peers in config, initializing with default bootstrappers as trusted peers
2024-06-18T06:08:17.170Z        INFO    header/p2p      p2p/subscriber.go:81    joining topic   {"topic ID": "/celestia/header-sub/v0.0.1"}
2024-06-18T06:08:17.170Z        INFO    header/p2p      p2p/exchange.go:99      client: starting client {"protocol ID": "/celestia/header-ex/v0.0.3"}
2024-06-18T06:08:17.170Z        INFO    pidstore        pidstore/pidstore.go:67 Loaded peers from disk  {"amount": 0}
2024-06-18T06:08:17.468Z        INFO    pidstore        pidstore/pidstore.go:84 Persisted peers successfully    {"amount": 0}
2024-06-18T06:08:17.468Z        WARN    header/p2p      p2p/subscriber.go:92    unregistering validator: no validator for topic /celestia/header-sub/v0.0.1
2024-06-18T06:08:17.468Z        INFO    basichost       basic/natmgr.go:112     DiscoverNAT error:no NAT found
2024-06-18T06:08:17.469Z        INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:546       Lifetime L0 stalled for: 0s

2024-06-18T06:08:17.476Z        INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:625       
Level 0 [ ]: NumTables: 00. Size: 0 B of 0 B. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 16 MiB
Level 1 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 2 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 3 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 4 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 5 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level 6 [B]: NumTables: 01. Size: 3.5 KiB of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Level Done
Error: node: failed to start: header/p2p: failed to open a new stream: failed to dial: failed to dial 12D3KooWQpuTFELgsUypqp9N4a1rKBccmrmQVY8Em9yhqppTJcXf: all dials failed
  * [/ip4/163.172.132.189/tcp/2121] dial backoff

Notes

While I tried to run light node in any of p2p networks, it fails to connect via p2p

However, it seems like network-related issue, 'cause when I used mobile internet provider, it successfully started. I remember having similar issue in one of my projects, where we solved it by increasing p2p timeout a bit.

Is it possible to configure celestia with increased p2p timeout or solve the issue in some other way?

renaynay commented 4 days ago

@vlad2095 what network are you trying to connect to? if it's mainnet, you do not have to pass the --p2p.network flag, or if you want to, you can pass --p2p.network mainnet

vlad2095 commented 4 days ago

@vlad2095 what network are you trying to connect to? if it's mainnet, you do not have to pass the --p2p.network flag, or if you want to, you can pass --p2p.network mainnet

Hi @renaynay I have the same results

Is there a way to increase p2p timeout?

image image
vlad2095 commented 4 days ago

Attempt to run manually from this repo was a success:

make build
./build/celestia light init --p2p.network celestia --core.ip rpc.celestia.pops.one
./build/celestia light start --p2p.network celestia --core.ip rpc.celestia.pops.one

Output:

Started celestia DA node 
node version:   v0.14.0-14-g1fa5aa6a
node type:      light
network:        celestia

/_____/  /_____/  /_____/  /_____/  /_____/ 

The p2p host is listening on:
*  /ip4/79.110.128.121/tcp/36799/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD
2024-06-27T15:33:08.031+0300    INFO    das     das/worker.go:94        finished sampling headers       {"type": "catchup", "from": 1, "to": 1, "errors": 0, "# of headers skipped as outside of sampling window": 1, "finished (s)": 0.000020708}
*  /ip4/79.110.128.121/udp/36799/quic-v1/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD
*  /ip4/79.110.128.121/udp/36799/quic-v1/webtransport/certhash/uEiBJc3-yrgTAxE_cxd25ZXMRu4x1GN-A0ZxudceRSoTmew/certhash/uEiDsXeQEqY3k3fWoFW3_JLwyV2TmK-R-J4z9zHN21eNSaA/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD
*  /ip4/127.0.0.1/udp/2121/quic-v1/webtransport/certhash/uEiBJc3-yrgTAxE_cxd25ZXMRu4x1GN-A0ZxudceRSoTmew/certhash/uEiDsXeQEqY3k3fWoFW3_JLwyV2TmK-R-J4z9zHN21eNSaA/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD
*  /ip4/192.168.50.254/tcp/2121/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD
*  /ip4/192.168.50.254/udp/2121/quic-v1/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD
*  /ip4/192.168.50.254/udp/2121/quic-v1/webtransport/certhash/uEiBJc3-yrgTAxE_cxd25ZXMRu4x1GN-A0ZxudceRSoTmew/certhash/uEiDsXeQEqY3k3fWoFW3_JLwyV2TmK-R-J4z9zHN21eNSaA/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD
*  /ip6/::1/tcp/2121/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD
*  /ip6/::1/udp/2121/quic-v1/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD
*  /ip6/::1/udp/2121/quic-v1/webtransport/certhash/uEiBJc3-yrgTAxE_cxd25ZXMRu4x1GN-A0ZxudceRSoTmew/certhash/uEiDsXeQEqY3k3fWoFW3_JLwyV2TmK-R-J4z9zHN21eNSaA/p2p/12D3KooWJasWutWWCjKxiuJSg3Zg47VnvGrU8rNgmJEhyARAvVyD

2024-06-27T15:33:12.495+0300    INFO    share/discovery discovery/discovery.go:325      discovered wanted peers {"topic": "full", "amount": 5}
renaynay commented 4 days ago

hey @vlad2095 my apologies, celestia is actually a valid input for --p2p.network

Interesting -- there might be some issues with reachability running the node in docker on Mac. We will investigate, thanks.

vlad2095 commented 4 days ago

@renaynay not exactly related to Mac. I thought that too, but when I switched internet connection to mobile 4G I was able to connect. Also my friend from Lisbon didn't have such issues in the first place while running the same way in Docker on M1

Most likely related to p2p timeout.

renaynay commented 4 days ago

Thanks for the info @vlad2095

vlad2095 commented 4 days ago

@renaynay I just pulled celestia-node repo and played a bit with it. With default config it successfully run on my Mac 👍
But when I changed it by adding libp2p timeout config below nodebuilder/p2p/host.go#L85

setting libp2p.WithDialTimeout(time.Microsecond * 100)

I experience same result as I did inside Docker

image

While increasing it to 500 millisecond fixed the issue

image

So I guess Docker introduce some delays. Is it possible to make this parameter configurable, with a flag?

Wondertan commented 4 days ago

@vlad2095, that must be a flake. The default setting for this timeout is 15 seconds, so setting 500ms decreases the timeout.

Wondertan commented 4 days ago

(I don't have any good ideas on what could cause this issue for you. The only fact that 4G works suggests that this is unlikely related to Docker)

Wondertan commented 4 days ago

@vlad2095, another thing you could try is increasing the StartupTimeout in the node config

vlad2095 commented 4 days ago

@Wondertan how would I increase StartupTimeout?

Yeah and 4G in my region actually is slower than cabel, so maybe it's a more "low level" network issue. We tested in different location on MacOS with docker, and getting different results, like it works in Portugal but fails in Poland and Ukraine.

Wondertan commented 4 days ago

how would I increase StartupTimeout?

In the config .celestia-light/config.toml

vlad2095 commented 4 days ago

unfortunately, doesn't work

image
Wondertan commented 4 days ago

Can you ping those IPs? At least to see that they are accessible from your network on IP/ICMP level

vlad2095 commented 3 days ago

Looks like they are

image
smuu commented 3 days ago

Hello @vlad2095,

If this issue only happens when running the node in a docker container, please try the following:

Please let me know if the issue still persists after trying these things.