Walrus Site Builder: Error on deploy: Could not find the referenced transaction

kkomelin commented 3 months ago

Console log

2024-07-05T07:38:21.043318Z  INFO site_builder: initializing site builder
2024-07-05T07:38:21.043555Z  INFO site_builder: configuration loaded config=Config { portal: "walrus.site", package: 0x514cf7ce2df33b9e2ca69e75bc9645ef38aca67b6f2852992a34e35e9f907f58, general: GeneralArgs { rpc_url: None, wallet: Some("/home/kos/suibase/workdirs/testnet/config/client.yaml"), walrus_binary: Some("walrus"), walrus_config: None, gas_budget: Some(500000000) } }
2024-07-05T07:38:21.043600Z  INFO site_builder::util: Using wallet configuration from /home/kos/suibase/workdirs/testnet/config/client.yaml
Parsing the directory ./dist and locally computing blob IDs ... [Ok]
Storing resource on Walrus: /android-chrome-192x192.png ... [Ok]
Storing resource on Walrus: /android-chrome-512x512.png ... 
Error during execution
Error: running the command exited with error: 2024-07-05T07:39:11.642004Z  INFO walrus: running in JSON mode
2024-07-05T07:39:11.642097Z  INFO walrus_service::cli_utils: Using Walrus configuration from /home/kos/.walrus/client_config.yaml
2024-07-05T07:39:11.642197Z  INFO walrus_service::cli_utils: Using wallet configuration from /home/kos/suibase/workdirs/testnet/config/client.yaml
2024-07-05T07:39:11.793691Z  WARN sui_sdk::wallet_context: Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.29.0
[warn] Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.29.0
2024-07-05T07:39:12.344443Z  INFO walrus: Storing blob read from the filesystem file=./dist/android-chrome-512x512.png
2024-07-05T07:39:13.139700Z  WARN reserve_and_store_blob{blob_id="6VQgs6WPlApGbLY03w8DSNyX-Tkqe4EFngEAb32onnc"}:verify_blob_status{blob_id=BlobId(6VQgs6WPlApGbLY03w8DSNyX-Tkqe4EFngEAb32onnc) status=Existent { end_epoch: 1, status: Certified, status_event: EventID { tx_digest: TransactionDigest(G73DLxgEVrt3aigiCiBnbMTe5PBNnpe3iQEGcjxPNeRf), event_seq: 0 } }}: walrus_service::client: error=RPC call failed: ErrorObject { code: InvalidParams, message: "Could not find the referenced transaction [TransactionDigest(G73DLxgEVrt3aigiCiBnbMTe5PBNnpe3iQEGcjxPNeRf)].", data: None }
Error: did not receive a valid blob status from the quorum of nodes

Just in case you need it, the file that caused the error is https://github.com/kkomelin/sui-dapp-starter/blob/main/packages/frontend/public/android-chrome-512x512.png

Versions

OS: Ubuntu 22.04 LTS Walrus and Walrus Site Builder: Latest (intentionally updated)

How to reproduce

# Clone the reproduction project
git clone git@github.com:kkomelin/sui-dapp-starter.git
cd sui-dapp-starter
# Install deps
pnpm install
# Run local testnet interfaces (through Suibase)
pnpm testnet:start
# Get current testnet address from the output of this command
pnpm testnet:address
# Fund the address via Sui Discord #testnet-faucet channel
# and then try to deploy the app to Walrus
pnpm frontend:deploy:walrus

mlegner commented 3 months ago

Thanks for the report. Does this issue persist? The output indicates an error during the RPC call to verify the blob status, this might have been a temporary issue of the used full node.

If the issue persists, however, I need to take a closer look.

kkomelin commented 3 months ago

Hey @mlegner,

I saw this issue a few days ago with previous version of Walrus and Site Builder. I thought my stuff was just outdated. Today I found time to upgrade both Walrus and Site Builder and experienced the issue again.

Just now, I ran the deployment again and received a bit different error message:

2024-07-05T10:40:55.894813Z  INFO site_builder: initializing site builder
2024-07-05T10:40:55.899457Z  INFO site_builder: configuration loaded config=Config { portal: "walrus.site", package: 0x514cf7ce2df33b9e2ca69e75bc9645ef38aca67b6f2852992a34e35e9f907f58, general: GeneralArgs { rpc_url: None, wallet: Some("/home/kos/suibase/workdirs/testnet/config/client.yaml"), walrus_binary: Some("walrus"), walrus_config: None, gas_budget: Some(500000000) } }
2024-07-05T10:40:55.900230Z  INFO site_builder::util: Using wallet configuration from /home/kos/suibase/workdirs/testnet/config/client.yaml
Parsing the directory ./dist and locally computing blob IDs ... [Ok]
Storing resource on Walrus: /android-chrome-192x192.png ... [Ok]
Storing resource on Walrus: /android-chrome-512x512.png ... [Ok]
Storing resource on Walrus: /apple-touch-icon.png ... [Ok]
Storing resource on Walrus: /assets/index-Bl1oXb_-.js ... [Ok]
Storing resource on Walrus: /assets/index-Cc9PsNB6.css ... [Ok]
Storing resource on Walrus: /browserconfig.xml ... [Ok]
Storing resource on Walrus: /emoji/1.svg ... [Ok]
Storing resource on Walrus: /emoji/10.svg ... 
Error during execution
Error: running the command exited with error: 2024-07-05T10:42:21.800741Z  INFO walrus: running in JSON mode
2024-07-05T10:42:21.800871Z  INFO walrus_service::cli_utils: Using Walrus configuration from /home/kos/.walrus/client_config.yaml
2024-07-05T10:42:21.801002Z  INFO walrus_service::cli_utils: Using wallet configuration from /home/kos/suibase/workdirs/testnet/config/client.yaml
2024-07-05T10:42:21.963419Z  WARN sui_sdk::wallet_context: Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.28.2
[warn] Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.28.2
2024-07-05T10:42:22.429975Z  INFO walrus: Storing blob read from the filesystem file=./dist/emoji/10.svg
Error: client internal error: Error in ObjectResponse: NotExists { object_id: 0xb2dddc6ff6e5905d4e16364f7ffc64db841444cf958bd0b704b8dcdb5ed5efc3 }

Looks like android-chrome-512x512.png is not the reason of the problem.

mlegner commented 3 months ago

Hmm... Also the new error indicates a problem with the full node, which doesn't find some Sui object (although related to a different file now).

I haven't been able to reproduce either error locally. For example, when I check for the status of the file android-chrome-512x512.png, I get:

$ walrus blob-status -b 6VQgs6WPlApGbLY03w8DSNyX-Tkqe4EFngEAb32onnc
2024-07-05T12:20:31.728928Z  INFO walrus_service::cli_utils: Using Walrus configuration from /Users/markuslegner/.walrus/client_config.yaml
2024-07-05T12:20:31.729113Z  INFO walrus_service::cli_utils: Using wallet configuration from /Users/markuslegner/.sui/sui_config/client.yaml
2024-07-05T12:20:31.729493Z  INFO walrus_service::cli_utils: Using RPC URL set in wallet configuration
2024-07-05T12:20:32.318631Z  WARN sui_sdk::wallet_context: Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.28.2
[warn] Client/Server api version mismatch, client api version : 1.27.2, server api version : 1.28.2
Status for blob ID 6VQgs6WPlApGbLY03w8DSNyX-Tkqe4EFngEAb32onnc: certified
End epoch: 1
Related event: (tx: G73DLxgEVrt3aigiCiBnbMTe5PBNnpe3iQEGcjxPNeRf, seq: 0)

What full node is configured in your /home/kos/suibase/workdirs/testnet/config/client.yaml?

kkomelin commented 3 months ago

Sure, here are my configs:

/home/kos/suibase/workdirs/testnet/config/client.yaml

---
keystore:
  File: /home/kos/suibase/workdirs/testnet/config-default/sui.keystore
envs:
  - alias: testnet_proxy
    rpc: "http://0.0.0.0:44342"
    ws: ~
    basic_auth: ~
  - alias: testnet
    rpc: "https://fullnode.testnet.sui.io:443"
    ws: ~
    basic_auth: ~
active_env: testnet_proxy
active_address: "0xc2e474f74bf8e8b9e013cb7f223f4f56c24aeba51299459bf5a9ed64d1a63888"

walrus.yaml - site-builder config

package: 0x514cf7ce2df33b9e2ca69e75bc9645ef38aca67b6f2852992a34e35e9f907f58

mlegner commented 3 months ago

Thanks for the info. Could it be a problem with the currently active local testnet_proxy? That is, can you check if the problem persists if you switch the env to testnet?

kkomelin commented 3 months ago

@mlegner Thank you for your help! You were right.

I've updated local testnet Sui cli and removed --wallet ~/suibase/workdirs/testnet/config/client.yaml from:

site-builder --config ./walrus.yaml --wallet ~/suibase/workdirs/testnet/config/client.yaml  publish ./dist

to use Sui cli directly (not through Suibase proxy) and it worked.

Will report this issue to Suibase.

mario4tier commented 3 months ago

@kkomelin I suspect it might relate to some RPC services may purge their older data. Since Suibase may load-balance among multiple servers, that might also cause the issue to be inconsistent.

Can you please dig to identify the exact JSON-RPC call (so I can curl it) and I can manually try with various servers?

mario4tier commented 3 months ago

If it is confirm that the problem relates to purging... then the solution is to find how Suibase may help make Walrus more robust by:

Force "history sensitive" RPC calls to be directed to servers with full history.
Have such RPC calls to be automatically retried by the proxy (with a different server) in case one server does not have the requested object.

( In the end, I assume Mysten Labs server is not the intended solution for production )

kkomelin commented 3 months ago

@mario4tier Thanks. I can try to debug further. Does Suibase have a verbose mode or collect logs somewhere?

kkomelin commented 3 months ago

An update from my site... After updating Sui cli today and now Suibase, the site deployment went well with no issues. For Walrus Sites, I have to keep the Sui cli updated because Walrus Site Builder actually uses the official Sui cli (not the one that Suibase provides).

Anyways, I think we can consider this issue resolved until I see it again (hopefully not). Thank you very much everyone for all your help 🙏

mario4tier commented 3 months ago

(@kkomelin thanks for sharing the logs offline)

I have verified that the Walrus RPC sui_getObject succeed with all community nodes integrated with Suibase.

BTW, I was probably wrong about the history "theory"... all existing object should be observable with sui_getObject. Pruning is more relevant when trying to retrieve old transaction (which I assume does not matter for Walrus).

What is the root cause? Likely a temporary glitch/rate limiting from one server, which might become more common once Mysten Labs go ahead with their planned rate limiting.

What's next (for me) On detecting response failure, will make Suibase retry with up to 2 other community server(s). This retry would apply only to "read-only" method such as "sui_getObject".

@mlegner Does Walrus already retry on sui_getObject failure? I assume not, but asking just in case.

mario4tier commented 3 months ago

@mlegner is it possible that there is a race condition between the moment an object is created and the moment it can be "read back" by all RPC services worldwide? (Sorry if it does not apply, admittedly I have limited knowledge of Walrus)

--> asking this because the error in Walrus logs says "NotExists", which suggest the server did not rate limit or failed to respond.

kkomelin commented 3 months ago

@mlegner is it possible that there is a race condition between the moment an object is created and the moment it can be "read back" by all RPC services worldwide? (Sorry if it does not apply, admittedly I have limited knowledge of Walrus)

--> asking this because the error in Walrus logs says "NotExists", which suggest the server did not rate limit or failed to respond.

This potential cause sounds very probable to me. @mlegner What do you think about adding 1-2-second delay until the object existence check to Walrus/Walrus Site Builder?

mlegner commented 2 months ago

Thanks for your investigations, @kkomelin and @mario4tier. I'm answering here in bulk (let me know if I missed something).

I suspect it might relate to some RPC services may purge their older data.

BTW, I was probably wrong about the history "theory"... all existing object should be observable with sui_getObject. Pruning is more relevant when trying to retrieve old transaction (which I assume does not matter for Walrus).

This could actually be part of the problem. Walrus uses events for some purposes, including communicating the status of blobs. These events are also checked by the client to verify information obtained from storage nodes.

Given that some full nodes purge event data, we may have to use a different mechanism to verify the status of a blob. I've created an internal issue for this (https://github.com/MystenLabs/walrus/issues/586).

For Walrus Sites, I have to keep the Sui cli updated because Walrus Site Builder actually uses the official Sui cli.

Just a remark here: Neither Walrus nor the site builder uses the Sui CLI directly. It just uses the configuration files defining the active address and the RPC node.

Does Walrus already retry on sui_getObject failure? I assume not, but asking just in case.

No, in this case we don't. It's also unclear how this would help as the full node would probably reply with the same error when asked again (unless it's a proxy that sends requests to different full nodes).

is it possible that there is a race condition between the moment an object is created and the moment it can be "read back" by all RPC services worldwide?

What do you think about adding 1-2-second delay until the object existence check to Walrus/Walrus Site Builder?

I've looked into the Error: client internal error: Error in ObjectResponse: NotExists { object_id: 0xb2dddc6ff6e5905d4e16364f7ffc64db841444cf958bd0b704b8dcdb5ed5efc3 } a bit more, and I believe this is indeed such a race condition.

The reason why this occurs is that the on-chain registration of a blob is a two-step process consisting of (1) obtaining a storage resource and (2) using that resource to register the blob. The problem doesn't occur when using the same RPC node for both requests as it obviously has the data of the first request, but it can happen when the requests are proxied to different RPC nodes (which is what I suspect happened here).

A delay would be one option, but in this specific case I think using a PTB to combine the two calls would be a better solution. I've created an internal issue for this (https://github.com/MystenLabs/walrus/issues/587).

Again, thank you so much for reporting and investigating these issues! This type of feedback is extremely valuable for us to make the whole system more robust.

mario4tier commented 2 months ago

Thanks for the clear explanations and path forward. Learning a few things along the way :smile:

Suibase proxy now mitigates with "retries on NotExists" with up to 3 different servers (and 1 second delay in-between).

kkomelin commented 2 months ago

@mlegner @mario4tier Thank you very much guys for your diverse help here! I really appreciate it.

I've just tested my app deployment with the latest Suibase snapshot. No issues noticed.

Do you think we can close this issue for now?

mlegner commented 2 months ago

Yes, I'll close this issue, we will work on the things we noticed here in the background. 🙂

MystenLabs / walrus-sites