celestiaorg / celestia-node

Celestia Data Availability Nodes
Apache License 2.0
923 stars 919 forks source link

Node store flag doesnt work #3450

Closed mindstyle85 closed 3 months ago

mindstyle85 commented 3 months ago

Celestia Node version

v0.13.6

OS

ubuntu

Install tools

No response

Others

No response

Steps to reproduce it

from figment:

/home/celestia/mocha-4/bin/celestia-node-v0.13.6 bridge start --p2p.network mocha --node.store /home/celestia/mocha-4-bridge --node.config /home/celestia/mocha-4-bridge/config.toml --metrics --metrics.tls=false

but the logs show:

ERROR shrex/nd shrexnd/server.go:128 handling request {"source": "server", "peer": "12D3KooWJzT9VwXa2hKvrujBKgnTcdVfohfBxQitcRW5eJCEANeD", "namespace": "00000000000000000000000000000000000000a66006f4805643fb8f24", "hash": "81E8AF5CFC8931C50EEF6D23D1E922A50F187720B7448C713A656336B14E618C", "err": "retrieving DAH: eds/store: failed to get CAR file: failed to get accessor: failed to acquire shard: shard is in errored state; err: failed to acquire reader of mount so we can return the accessor: open /root/.celestia-bridge-mocha-4/blocks/81E8AF5CFC8931C50EEF6D23D1E922A50F187720B7448C713A656336B14E618C: permission denied"}

Expected result

using correct node.store path

Actual result

n/a

Relevant log output

No response

Notes

No response

Liver-23 commented 3 months ago

Getting the same issue after restoration from the snapshot and without explicit --node.store, and as I recall I didn't see it before after the bridge was initialized and synced from zero. And adding the --node.store flag doesn't fix the problem

Tried also to put WorkingDirectory= into systemd with no luck. Seems like the existing snapshot has some indexing of the store path and tries to work with those files even the store path is overwritten

node version:   v0.13.6                                                                                           
node type:      bridge                                                                                            
network:        mocha-4 
maxzonder commented 3 months ago

Nodes synced from 0 height with a custom --node.store path are running without any problem. This only happens after restoring from a snapshot. Just reproduced:

  1. full_storage node snapshot was taken from a node running from the default path /home/mzonder/.celestia-full-mocha-4/
  2. On the other server full_storage node was inited with custom path: celestia full init --p2p.network mocha-4 --node.store /home/mzonder/full-node/.celestia-full-mocha-4 ...
  3. Snapshot was restored
  4. Node was started celestia full start --node.store /home/mzonder/full-node/.celestia-full-mocha-4 --node.config /home/mzonder/full-node/.celestia-full-mocha-4/config.toml
  5. Node successfully started catching up until the first ERROR share/eds eds/store.go:199 removing shard after failure attemp

Logs of this moment:

2024-05-30T14:18:17.743Z        INFO    header/store    store/store.go:375      new head        {"height": 1938724, "hash": "3A70C118301E105275522FE74BA4227FABFF415558B5D93AC26145D5A15E9E14"}
2024-05-30T14:18:17.763Z        INFO    das     das/worker.go:161       sampled header  {"type": "recent", "height": 1938724, "hash": "3A70C118301E105275522FE74BA4227FABFF415558B5D93AC26145D5A15E9E14", "square width": 32, "data root": "382787D61ADC29BDF867AE399BC6D68922F652E4C510F0BCF94E9F9BADF36B71", "finished (s)": 0.019159128}
2024-05-30T14:18:17.863Z        INFO    pidstore        pidstore/pidstore.go:84 Persisted peers successfully    {"amount": 87}
2024-05-30T14:18:22.779Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:18:29.432Z        INFO    header/store    store/store.go:375      new head        {"height": 1938725, "hash": "EBC6904307023B55AFE0C56F3C3E135549FD6028FD0621C22173C213537036EE"}
2024-05-30T14:18:29.457Z        INFO    das     das/worker.go:161       sampled header  {"type": "recent", "height": 1938725, "hash": "EBC6904307023B55AFE0C56F3C3E135549FD6028FD0621C22173C213537036EE", "square width": 32, "data root": "6AEBBE03F041138B9C156E254256BA1EFE89EF82F5BA9D0617C0DA6E9C5D1857", "finished (s)": 0.025673256}
2024-05-30T14:18:31.937Z        INFO    das     das/store.go:60 stored checkpoint to disk: SampleFrom: 1938726, NetworkHead: 1938725
2024-05-30T14:18:37.976Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:18:41.188Z        INFO    header/store    store/store.go:375      new head        {"height": 1938726, "hash": "3A2C01F72D6AFFE1B788584DEA45361678DFA8CE2A22B35126BA08CB7CBE6B9B"}
2024-05-30T14:18:41.280Z        INFO    das     das/worker.go:161       sampled header  {"type": "recent", "height": 1938726, "hash": "3A2C01F72D6AFFE1B788584DEA45361678DFA8CE2A22B35126BA08CB7CBE6B9B", "square width": 32, "data root": "223C1F1229C6B751CC503B1851A4F0CAC5ED8BC5C6DB0381628C32B5EA8A0C30", "finished (s)": 0.092379655}
2024-05-30T14:18:43.232Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:18:46.267Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:18:52.920Z        INFO    header/store    store/store.go:375      new head        {"height": 1938727, "hash": "EB455DB3CB3FEBC870C840EC62576D95AB6896A8D8A24E7516226E1BC44C8D56"}
2024-05-30T14:18:52.943Z        INFO    das     das/worker.go:161       sampled header  {"type": "recent", "height": 1938727, "hash": "EB455DB3CB3FEBC870C840EC62576D95AB6896A8D8A24E7516226E1BC44C8D56", "square width": 32, "data root": "0958396F6244A7A2D6471620816AB48A8E92A45FC714D20452DB9A0727D9D69E", "finished (s)": 0.022327369}
2024-05-30T14:18:54.234Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:19:03.382Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:19:04.606Z        INFO    header/store    store/store.go:375      new head        {"height": 1938728, "hash": "A3F719895405149824135AAA8F4539710F9ED70572E13887EB7BC05253A75675"}
2024-05-30T14:19:04.787Z        INFO    das     das/worker.go:161       sampled header  {"type": "recent", "height": 1938728, "hash": "A3F719895405149824135AAA8F4539710F9ED70572E13887EB7BC05253A75675", "square width": 64, "data root": "FB0D0BDA1ACD289A4DE9C3552245DC38B7D05FE634FF6F42AFECCC0815A046AC", "finished (s)": 0.180223214}
2024-05-30T14:19:12.989Z        WARN    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:460       Block cache might be too small. Metrics: hit: 2159 miss: 50481648 keys-added: 3044 keys-updated: 129 keys-evicted: 3024 cost-added: 42162384969 cost-evicted: 41894938305 sets-dropped: 2 sets-rejected: 1278 gets-dropped: 16911552 gets-kept: 33534144 gets-total: 50483807 hit-ratio: 0.00
2024-05-30T14:19:12.989Z        WARN    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:461       Cache life expectancy (in seconds):
 -- Histogram:
Min value: 0
Max value: 344
Count: 2516
50p: 2.00
75p: 2.00
90p: 16.00
[0, 2) 2177 86.53% 86.53%
[2, 4) 15 0.60% 87.12%
[4, 8) 8 0.32% 87.44%
[8, 16) 271 10.77% 98.21%
[16, 32) 27 1.07% 99.28%
[32, 64) 13 0.52% 99.80%
[128, 256) 2 0.08% 99.88%
[256, 512) 3 0.12% 100.00%
 --
2024-05-30T14:19:16.224Z        INFO    header/store    store/store.go:375      new head        {"height": 1938729, "hash": "31749D16D9E6E1306512E2C305911FB29E2B55C821CFFE465CBAB0DAE08D889E"}
2024-05-30T14:19:16.248Z        INFO    das     das/worker.go:161       sampled header  {"type": "recent", "height": 1938729, "hash": "31749D16D9E6E1306512E2C305911FB29E2B55C821CFFE465CBAB0DAE08D889E", "square width": 32, "data root": "0660AC93E2805B625B503F091640DCBBC89C0CC624868C51A6C40EF7AE6B56D3", "finished (s)": 0.023504637}
2024-05-30T14:19:22.730Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:19:28.349Z        INFO    header/store    store/store.go:375      new head        {"height": 1938730, "hash": "81B492B99429DB9FD386033D1D1265B60C598495B9BF8521392F7C2D0262E8A3"}
2024-05-30T14:19:28.433Z        INFO    das     das/worker.go:161       sampled header  {"type": "recent", "height": 1938730, "hash": "81B492B99429DB9FD386033D1D1265B60C598495B9BF8521392F7C2D0262E8A3", "square width": 64, "data root": "61C3EF7F399008C4D046434F0428227246057F923B4BF88103ACC38791D77E2D", "finished (s)": 0.084622077}
2024-05-30T14:19:30.631Z        WARN    dagstore        dagstore@v0.0.0-20230824094345-537c012aa403/dagstore_async.go:39        acquire: failed to fetch from mount upgrader    {"shard": "39F0AB4F339363D96607AC9F3F4A84198982BC392617DB9943EED3C131CBA6A9", "error": "open /home/mzonder/.celestia-full-mocha-4/blocks/39F0AB4F339363D96607AC9F3F4A84198982BC392617DB9943EED3C131CBA6A9: no such file or directory"}
2024-05-30T14:19:30.632Z        ERROR   share/eds       eds/store.go:199        removing shard after failure    {"key": "39F0AB4F339363D96607AC9F3F4A84198982BC392617DB9943EED3C131CBA6A9", "err": "failed to acquire reader of mount so we can return the accessor: open /home/mzonder/.celestia-full-mocha-4/blocks/39F0AB4F339363D96607AC9F3F4A84198982BC392617DB9943EED3C131CBA6A9: no such file or directory"}
2024-05-30T14:19:30.632Z        ERROR   shrex/eds       shrexeds/server.go:120  server: get CAR {"peer": "12D3KooWFKs93f3KGuLoAGrGVXNvRvsQgj4M9bmNzkM3EReXLa7w", "hash": "39F0AB4F339363D96607AC9F3F4A84198982BC392617DB9943EED3C131CBA6A9", "err": "failed to get accessor: failed to acquire shard: open /home/mzonder/.celestia-full-mocha-4/blocks/39F0AB4F339363D96607AC9F3F4A84198982BC392617DB9943EED3C131CBA6A9: no such file or directory"}
2024-05-30T14:19:39.999Z        INFO    header/store    store/store.go:375      new head        {"height": 1938731, "hash": "EDED983A73058DF8139EBE29B2E9141CF43B341319D9D46AF4D0D422D427765B"}
2024-05-30T14:19:40.188Z        INFO    das     das/worker.go:161       sampled header  {"type": "recent", "height": 1938731, "hash": "EDED983A73058DF8139EBE29B2E9141CF43B341319D9D46AF4D0D422D427765B", "square width": 32, "data root": "B52CA4208F5982222F10B745392C608C879266E0D0EE503C6500C519A3278A3B", "finished (s)": 0.188342383}
2024-05-30T14:19:45.503Z        WARN    dagstore        dagstore@v0.0.0-20230824094345-537c012aa403/dagstore_async.go:39        acquire: failed to fetch from mount upgrader    {"shard": "50315DB9C299434389571117D20AEE205CA2D8122C174823FCF0B52F9F73E070", "error": "open /home/mzonder/.celestia-full-mocha-4/blocks/50315DB9C299434389571117D20AEE205CA2D8122C174823FCF0B52F9F73E070: no such file or directory"}
2024-05-30T14:19:45.504Z        ERROR   shrex/eds       shrexeds/server.go:120  server: get CAR {"peer": "12D3KooWFKs93f3KGuLoAGrGVXNvRvsQgj4M9bmNzkM3EReXLa7w", "hash": "50315DB9C299434389571117D20AEE205CA2D8122C174823FCF0B52F9F73E070", "err": "failed to get accessor: failed to acquire shard: open /home/mzonder/.celestia-full-mocha-4/blocks/50315DB9C299434389571117D20AEE205CA2D8122C174823FCF0B52F9F73E070: no such file or directory"}
2024-05-30T14:19:45.504Z        ERROR   share/eds       eds/store.go:199        removing shard after failure    {"key": "50315DB9C299434389571117D20AEE205CA2D8122C174823FCF0B52F9F73E070", "err": "failed to acquire reader of mount so we can return the accessor: open /home/mzonder/.celestia-full-mocha-4/blocks/50315DB9C299434389571117D20AEE205CA2D8122C174823FCF0B52F9F73E070: no such file or directory"}
2024-05-30T14:19:46.323Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:19:52.035Z        INFO    header/store    store/store.go:375      new head        {"height": 1938732, "hash": "72950186E617A221DE5781A9C4470C85B999021E75ADD41B7AC4EED825612BE7"}
2024-05-30T14:19:52.058Z        INFO    das     das/worker.go:161       sampled header  {"type": "recent", "height": 1938732, "hash": "72950186E617A221DE5781A9C4470C85B999021E75ADD41B7AC4EED825612BE7", "square width": 32, "data root": "B74A96F1ECFD741DA66CA805995D8A543D49985A1BEBD2CACB84FC5AEB01D9D1", "finished (s)": 0.023267012}
2024-05-30T14:19:53.232Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:19:56.267Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:19:57.976Z        INFO    bitswap-server  server/server.go:547    Bitswap Client ReceiveError: Application error 0x0 (remote)
2024-05-30T14:20:03.743Z        INFO    header/store    store/store.go:375      new head        {"height": 1938733, "hash": "558BDEF3F578589BA1E234B390D895839E8241B83DDA4BB9905BEA43B60BCB9A"
Liver-23 commented 3 months ago

The funny thing is that after applying the snapshot, you will run the node with the store on a different path. Then try to move the store to the same one used in the snapshot and you will have the same errors for all those blocks you already synced with the new store. Each block saves its location that can not be overwritten by the --node.store flag.

maxzonder commented 3 months ago

restored snapshot to the identical path (repeated all the steps above except changing the path) - works fine

RdtBui commented 3 months ago

@Liver-23 Does yours default to /root/.celestia-bridge-mocha-4/... too?

Liver-23 commented 3 months ago

Yes, I used this snapshot https://kingsuper.org/snapshots/celestia-bridge-testnet Curious if @maxzonder used the same or any other that has the same dir as a default, like data saves the path and despite you want to use a different it goes where it was initially stored on the original.

maxzonder commented 3 months ago

i used my own so mine stuck at /home/mzonder/..

Liver-23 commented 3 months ago

Yup, tried to sync my own, then move the store to the different location and reconfigure to it, and then started to see those message when the node is trying to find previously synced blocks in old location

ramin commented 3 months ago

@Liver-23 @mindstyle85 @maxzonder is this new behavior since 0.13.6 or something that maybe existed for longer and is just discovered now?

mindstyle85 commented 3 months ago

i havent seen reports of this before but i am not sure since we never used the snapshot or different directory ourselves

Liver-23 commented 3 months ago

I joined with 0.13.6 already, therefore, I can't be sure.

ramin commented 3 months ago

thanks for info @Liver-23 and @mindstyle85, i'm going to start looking into this now

maxzonder commented 3 months ago

i think i had the same issue on 0.13.1

ramin commented 3 months ago

ok a few things here:

1) i've confirmed that syncing a bridge node, and then moving the path of the store does mean the new --node.store path is seemingly "ignored". Excellent find and thanks for reporting. No one had run into this before 😸

2) this does not appear to be caused by improper handling of the node.store param, but a side effect of one of the EDS datastores indexing absolute paths (appears to be dagstore index) https://github.com/celestiaorg/celestia-node/blob/main/share/eds/store.go#L103

3) i think we could and normally would seek to fix it, BUT as it is not seemingly a quick and simple fix, and as we have a significant storage rewrite close to shipping that will remove this behavior entirely in the next two months (which @walldiss can confirm) i propose that we, for now, close without fixing and ensure that testing moving store location on filesystem or recovering from snapshot work properly in our new storage engine.

As a temporary interim solution, it would seem that a symlink between new path from the original assumed path from the snapshot would solve this issue acceptably for now?

Liver-23 commented 3 months ago

Interesting possible workaround with symlink actually, curious if it will work, should, I guess. If so, that will simplify things for those who need to restore from the snapshot fast until the final fix.

clemensgg commented 3 months ago

Hi all, using a symlink is working but if you use a dedicated non-root user you might run into permission issues:

# create symlink
sudo ln -s /root/.celestia-bridge </path/to/your/.celestia-bridge>

# ensure the symlink /root/.celestia-bridge itself is owned by the celestia user
sudo chown -h celestia:celestia /root/.celestia-bridge

celestia-bridge run by user celestia fails to access the db:

celestia[173708]: 2024-06-15T07:02:54.607Z        ERROR        share/eds        eds/store.go:199        removing shard after failure        {"key": "F2150A832142748239CC18E51F9ADFF93CC5C30F5809A68BEC6A812315EEDBFC", "err": "failed to acquire reader of mount so we can return the accessor: open /root/.celestia-bridge/blocks/F2150A832142
748239CC18E51F9ADFF93CC5C30F5809A68BEC6A812315EEDBFC: permission denied"}

An unsafe fix for this would be to hand executable permissions on the whole root folder.

A better option is to use ACLs to grant the celestia user access to the /root directory specifically for the symlink without changing the global permissions of /root:

# Set ACL for the celestia user
sudo setfacl -m u:celestia:x /root

# Verify the ACL
getfacl /root
# file: root
# owner: root
# group: root
user::rwx
user:celestia:--x
group::--x
mask::--x
other::---

Restart celestia-bridge and observe logs:

celestia[174507]: 2024-06-15T07:25:48.146Z        INFO        header/store        store/store.go:375        new head        {"height": 1675259, "hash": "2AA94B4658DE2658ECCC102AB3A8C2D172874AA5AA4E78
DF379FBE94B201FD83"}
ramin commented 3 months ago

excellent! Yes I meant to mention it did seem in addition to the path issue there was potentially a user perm issue, great explanation of you solution

I'll ensure this snapshot issue does not persist (😂) with our new storage engine in coming months

lesnikutsa commented 5 days ago

excellent! Yes I meant to mention it did seem in addition to the path issue there was potentially a user perm issue, great explanation of you solution

I'll ensure this snapshot issue does not persist (😂) with our new storage engine in coming months

Hey hey After several months, do we have a fix for this bug? or is there still no fix except for symbolic links?

ramin commented 4 days ago

hi @lesnikutsa the new storage engine that resolves this is not yet released, you can follow progress here https://github.com/celestiaorg/celestia-node/pull/3675

lesnikutsa commented 4 days ago

hi @lesnikutsa the new storage engine that resolves this is not yet released, you can follow progress here #3675

I am testing bridge snapshots in different directories but one (root) user. I downloaded a snapshot created from the / directory (root user). I unzipped the snapshot to the /bigdata directory (root user) and I don't see any global ERRORS (after syncing)

So far the only errors I saw were these at the moment of synchronization (while my node was catching up with the network) image

lesnikutsa commented 4 days ago

I don't see any more errors and after successful synchronization my logs look like this image

lesnikutsa commented 4 days ago

maybe the problem is in different users...

lesnikutsa commented 3 days ago

So - after another reboot of Bridge I see only errors. It is strange that after the first launch there were no errors image