Txn latency between websockets and http

avinashbo commented 1 month ago

Problem

We are noticing some latency with txn responses on our mainnet nodes running v1.18.23

We are subscribing to all transactions over the pubKey ComputeBudget111111111111111111111111111111 using the method logsSubscribe using the commitment level finalized
For each txn hash retrieved, we try to do a getTransaction on those hashes this time using a commitment level of confirmed

Now some of the hashes return a null and the rate of null we witness decreases when we introduce a sleep between logSubscribe and getTransaction . We are trying to understand why this would happen when both the websocket and http calls are being served by the same node.

Proposed Solution

ceciEstErmat commented 1 month ago

It's not really surprising, the only question is why do you need to refetch the hash after ? If you subscribe to tx events you should already have all the infos.

akegaviar commented 1 month ago

@ceciEstErmat can you please elaborate on why it's not surprising? I'd really appreciate that as I'm not well versed in what's going on under the hood in this scenario.

As for why, it's useful for sniping bots in different cases and it doesn't have to be a getTransaction specifically, but a call that you need to do asap on the data acquired through logsSubscribe.

Example:

A bot is a running logsSubscribe to get the newly minted token address.
Once the the minted token address is detected through logsSubscribe over websocket, the bot needs to create an associated token account for the token.
The bot can't make this transaction right away as the node is yet unaware of the token address acquired through the emitted event in logsSubscribe. This is pretty much the same scenario as sending a getTransaction request.
Introducing sleep always fixes the issue as apparently there's enough time for the token address to propagate.

It doesn't have to be a bot scenario; just an example from the fields.

bw-solana commented 1 month ago

@akegaviar - In your example, are the subscription (e.g. logsSubscribe) and follow-up query (e.g. getTransaction) being sent to the same node?

I'm trying to understand how there might be a race condition between nodes' chain state. Going from finalized to confirmed commitment level should provide a really wide window allowing all nodes to be able to observe chain state.

ceciEstErmat commented 1 month ago

@akegaviar To me it is not surprising as most RPC clusters are "eventually consistent" - you might be using an RPC provider and be assigned to event subscription on machine A but request machine B. Machine B might not yet at the same state as A.

Distributed system are for the most part "eventually consistent" this is why it did not surprise me, and coding such a thing i would expect such behavior.

Unsure if this clarify what i meant.

steviez commented 1 month ago

In your example, are the subscription (e.g. logsSubscribe) and follow-up query (e.g. getTransaction) being sent to the same node?

This is a critical question before going any further. The two previous comments stitch together a potential problem scenario:

Event subscription from node A
getTransaction request from node B

Suppose that node A is caught up with the tip of the cluster and node B is ~50 slots behind the tip. You could get a finalized tx hash from node A. Then, when you go to query that tx from node B, it could be the case that node B has not yet replayed that slot/tx. This could lead to node B being unable to return anything about that tx.

Depending on configuration, RPC nodes might be doing a lot of extra work and be under heavier load than "regular" nodes. We have definitely seen scenarios where heavy RPC traffic (requests) can cause an RPC node to fall behind. Rate limiting of nodes was left as something to perform outside of the validator client.

avinashbo commented 1 month ago

@bw-solana We were able to reproduce the issue when both logsSubscribe and getTransaction were sent to the same node.

I'm trying to understand how there might be a race condition between nodes' chain state

Wondering the same when we are hitting the same node and that too when logsSubscribe using finalized commitment level which takes a little longer than the confirmed commitment level

bw-solana commented 1 month ago

Makes me think the Transaction Status Service must be backed up. I believe if you request tx details before TSS has processed the updates, RPC server will return null or error

avinashbo commented 1 month ago

I still cannot understand why it would necessitate introducing a sleep to fix this issue when it is not two different nodes serving websocket and http rpc

bw-solana commented 1 month ago

I still cannot understand why it would necessitate introducing a sleep to fix this issue when it is not two different nodes serving websocket and http rpc

I agree. The sleep is more of a workaround. We need to figure out what's going on.

My best guess is TSS being slow, but we'll need to confirm.

lijunwangs commented 1 month ago

I have root-caused the issue. The problem is caused by the asynchronous nature of the logsSubscribe notification and the persistence of TransactionStatusMeta by the TransactionStatusService in the ReplayStage.

When a transaction is executed, the transaction logs are collected synchronously and the transaction status batch is sent to TransactionStatusService which will do the persistence asynchronously in a separate thread. Then logsSubscribe event (using CommitmentAggregationData --> NotificationEntry::Bank) is sent. Due to the asynchronously nature of these two processes, it may happen when getTransaction is being called the TransactionStatusMeta has not been written to the block store yet. We have the same mechanism for a long time so I do not think this is a regression.

Recommendations:

Check if TransactionLogInfo associated with the logsSubscribe notification is sufficient
Consider using geyser to stream the transaction events to your plugin which is guaranteed to be notified when the transaction written to blockstore by the TransactionStatusService.

bw-solana commented 1 month ago

This might help a little https://github.com/anza-xyz/agave/pull/3026

anza-xyz / agave

Txn latency between websockets and http #3074

Problem

Proposed Solution