Closed kkomelin closed 2 months ago
Thanks for the report. Does this issue persist? The output indicates an error during the RPC call to verify the blob status, this might have been a temporary issue of the used full node.
If the issue persists, however, I need to take a closer look.
Hey @mlegner,
I saw this issue a few days ago with previous version of Walrus and Site Builder. I thought my stuff was just outdated. Today I found time to upgrade both Walrus and Site Builder and experienced the issue again.
Just now, I ran the deployment again and received a bit different error message:
2024-07-05T10:40:55.894813Z INFO site_builder: initializing site builder
2024-07-05T10:40:55.899457Z INFO site_builder: configuration loaded config=Config { portal: "walrus.site", package: 0x514cf7ce2df33b9e2ca69e75bc9645ef38aca67b6f2852992a34e35e9f907f58, general: GeneralArgs { rpc_url: None, wallet: Some("/home/kos/suibase/workdirs/testnet/config/client.yaml"), walrus_binary: Some("walrus"), walrus_config: None, gas_budget: Some(500000000) } }
2024-07-05T10:40:55.900230Z INFO site_builder::util: Using wallet configuration from /home/kos/suibase/workdirs/testnet/config/client.yaml
Parsing the directory ./dist and locally computing blob IDs ... [Ok]
Storing resource on Walrus: /android-chrome-192x192.png ... [Ok]
Storing resource on Walrus: /android-chrome-512x512.png ... [Ok]
Storing resource on Walrus: /apple-touch-icon.png ... [Ok]
Storing resource on Walrus: /assets/index-Bl1oXb_-.js ... [Ok]
Storing resource on Walrus: /assets/index-Cc9PsNB6.css ... [Ok]
Storing resource on Walrus: /browserconfig.xml ... [Ok]
Storing resource on Walrus: /emoji/1.svg ... [Ok]
Storing resource on Walrus: /emoji/10.svg ...
Error during execution
Error: running the command exited with error: 2024-07-05T10:42:21.800741Z INFO walrus: running in JSON mode
2024-07-05T10:42:21.800871Z INFO walrus_service::cli_utils: Using Walrus configuration from /home/kos/.walrus/client_config.yaml
2024-07-05T10:42:21.801002Z INFO walrus_service::cli_utils: Using wallet configuration from /home/kos/suibase/workdirs/testnet/config/client.yaml
2024-07-05T10:42:21.963419Z WARN sui_sdk::wallet_context: Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.28.2
[warn] Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.28.2
2024-07-05T10:42:22.429975Z INFO walrus: Storing blob read from the filesystem file=./dist/emoji/10.svg
Error: client internal error: Error in ObjectResponse: NotExists { object_id: 0xb2dddc6ff6e5905d4e16364f7ffc64db841444cf958bd0b704b8dcdb5ed5efc3 }
Looks like android-chrome-512x512.png is not the reason of the problem.
Hmm... Also the new error indicates a problem with the full node, which doesn't find some Sui object (although related to a different file now).
I haven't been able to reproduce either error locally. For example, when I check for the status of the file android-chrome-512x512.png, I get:
$ walrus blob-status -b 6VQgs6WPlApGbLY03w8DSNyX-Tkqe4EFngEAb32onnc
2024-07-05T12:20:31.728928Z INFO walrus_service::cli_utils: Using Walrus configuration from /Users/markuslegner/.walrus/client_config.yaml
2024-07-05T12:20:31.729113Z INFO walrus_service::cli_utils: Using wallet configuration from /Users/markuslegner/.sui/sui_config/client.yaml
2024-07-05T12:20:31.729493Z INFO walrus_service::cli_utils: Using RPC URL set in wallet configuration
2024-07-05T12:20:32.318631Z WARN sui_sdk::wallet_context: Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.28.2
[warn] Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.28.2
Status for blob ID 6VQgs6WPlApGbLY03w8DSNyX-Tkqe4EFngEAb32onnc: certified
End epoch: 1
Related event: (tx: G73DLxgEVrt3aigiCiBnbMTe5PBNnpe3iQEGcjxPNeRf, seq: 0)
What full node is configured in your /home/kos/suibase/workdirs/testnet/config/client.yaml
?
Sure, here are my configs:
/home/kos/suibase/workdirs/testnet/config/client.yaml
---
keystore:
File: /home/kos/suibase/workdirs/testnet/config-default/sui.keystore
envs:
- alias: testnet_proxy
rpc: "http://0.0.0.0:44342"
ws: ~
basic_auth: ~
- alias: testnet
rpc: "https://fullnode.testnet.sui.io:443"
ws: ~
basic_auth: ~
active_env: testnet_proxy
active_address: "0xc2e474f74bf8e8b9e013cb7f223f4f56c24aeba51299459bf5a9ed64d1a63888"
walrus.yaml - site-builder config
package: 0x514cf7ce2df33b9e2ca69e75bc9645ef38aca67b6f2852992a34e35e9f907f58
Thanks for the info. Could it be a problem with the currently active local testnet_proxy
? That is, can you check if the problem persists if you switch the env to testnet
?
@mlegner Thank you for your help! You were right.
I've updated local testnet Sui cli and removed --wallet ~/suibase/workdirs/testnet/config/client.yaml
from:
site-builder --config ./walrus.yaml --wallet ~/suibase/workdirs/testnet/config/client.yaml publish ./dist
to use Sui cli directly (not through Suibase proxy) and it worked.
Will report this issue to Suibase.
@kkomelin I suspect it might relate to some RPC services may purge their older data. Since Suibase may load-balance among multiple servers, that might also cause the issue to be inconsistent.
Can you please dig to identify the exact JSON-RPC call (so I can curl it) and I can manually try with various servers?
If it is confirm that the problem relates to purging... then the solution is to find how Suibase may help make Walrus more robust by:
( In the end, I assume Mysten Labs server is not the intended solution for production )
@mario4tier Thanks. I can try to debug further. Does Suibase have a verbose mode or collect logs somewhere?
An update from my site... After updating Sui cli today and now Suibase, the site deployment went well with no issues. For Walrus Sites, I have to keep the Sui cli updated because Walrus Site Builder actually uses the official Sui cli (not the one that Suibase provides).
Anyways, I think we can consider this issue resolved until I see it again (hopefully not). Thank you very much everyone for all your help 🙏
(@kkomelin thanks for sharing the logs offline)
I have verified that the Walrus RPC sui_getObject succeed with all community nodes integrated with Suibase.
BTW, I was probably wrong about the history "theory"... all existing object should be observable with sui_getObject. Pruning is more relevant when trying to retrieve old transaction (which I assume does not matter for Walrus).
What is the root cause? Likely a temporary glitch/rate limiting from one server, which might become more common once Mysten Labs go ahead with their planned rate limiting.
What's next (for me) On detecting response failure, will make Suibase retry with up to 2 other community server(s). This retry would apply only to "read-only" method such as "sui_getObject".
@mlegner Does Walrus already retry on sui_getObject failure? I assume not, but asking just in case.
@mlegner is it possible that there is a race condition between the moment an object is created and the moment it can be "read back" by all RPC services worldwide? (Sorry if it does not apply, admittedly I have limited knowledge of Walrus)
--> asking this because the error in Walrus logs says "NotExists", which suggest the server did not rate limit or failed to respond.
@mlegner is it possible that there is a race condition between the moment an object is created and the moment it can be "read back" by all RPC services worldwide? (Sorry if it does not apply, admittedly I have limited knowledge of Walrus)
--> asking this because the error in Walrus logs says "NotExists", which suggest the server did not rate limit or failed to respond.
This potential cause sounds very probable to me. @mlegner What do you think about adding 1-2-second delay until the object existence check to Walrus/Walrus Site Builder?
Thanks for your investigations, @kkomelin and @mario4tier. I'm answering here in bulk (let me know if I missed something).
I suspect it might relate to some RPC services may purge their older data.
BTW, I was probably wrong about the history "theory"... all existing object should be observable with sui_getObject. Pruning is more relevant when trying to retrieve old transaction (which I assume does not matter for Walrus).
This could actually be part of the problem. Walrus uses events for some purposes, including communicating the status of blobs. These events are also checked by the client to verify information obtained from storage nodes.
Given that some full nodes purge event data, we may have to use a different mechanism to verify the status of a blob. I've created an internal issue for this (https://github.com/MystenLabs/walrus/issues/586).
For Walrus Sites, I have to keep the Sui cli updated because Walrus Site Builder actually uses the official Sui cli.
Just a remark here: Neither Walrus nor the site builder uses the Sui CLI directly. It just uses the configuration files defining the active address and the RPC node.
Does Walrus already retry on sui_getObject failure? I assume not, but asking just in case.
No, in this case we don't. It's also unclear how this would help as the full node would probably reply with the same error when asked again (unless it's a proxy that sends requests to different full nodes).
is it possible that there is a race condition between the moment an object is created and the moment it can be "read back" by all RPC services worldwide?
What do you think about adding 1-2-second delay until the object existence check to Walrus/Walrus Site Builder?
I've looked into the Error: client internal error: Error in ObjectResponse: NotExists { object_id: 0xb2dddc6ff6e5905d4e16364f7ffc64db841444cf958bd0b704b8dcdb5ed5efc3 }
a bit more, and I believe this is indeed such a race condition.
The reason why this occurs is that the on-chain registration of a blob is a two-step process consisting of (1) obtaining a storage resource and (2) using that resource to register the blob. The problem doesn't occur when using the same RPC node for both requests as it obviously has the data of the first request, but it can happen when the requests are proxied to different RPC nodes (which is what I suspect happened here).
A delay would be one option, but in this specific case I think using a PTB to combine the two calls would be a better solution. I've created an internal issue for this (https://github.com/MystenLabs/walrus/issues/587).
Again, thank you so much for reporting and investigating these issues! This type of feedback is extremely valuable for us to make the whole system more robust.
Thanks for the clear explanations and path forward. Learning a few things along the way :smile:
Suibase proxy now mitigates with "retries on NotExists" with up to 3 different servers (and 1 second delay in-between).
@mlegner @mario4tier Thank you very much guys for your diverse help here! I really appreciate it.
I've just tested my app deployment with the latest Suibase snapshot. No issues noticed.
Do you think we can close this issue for now?
Yes, I'll close this issue, we will work on the things we noticed here in the background. 🙂
Console log
Just in case you need it, the file that caused the error is https://github.com/kkomelin/sui-dapp-starter/blob/main/packages/frontend/public/android-chrome-512x512.png
Versions
OS: Ubuntu 22.04 LTS Walrus and Walrus Site Builder: Latest (intentionally updated)
How to reproduce