Open avinashbo opened 1 month ago
It's not really surprising, the only question is why do you need to refetch the hash after ? If you subscribe to tx events you should already have all the infos.
@ceciEstErmat can you please elaborate on why it's not surprising? I'd really appreciate that as I'm not well versed in what's going on under the hood in this scenario.
As for why, it's useful for sniping bots in different cases and it doesn't have to be a getTransaction
specifically, but a call that you need to do asap on the data acquired through logsSubscribe
.
Example:
logsSubscribe
to get the newly minted token address.logsSubscribe
over websocket, the bot needs to create an associated token account for the token.logsSubscribe
. This is pretty much the same scenario as sending a getTransaction
request.It doesn't have to be a bot scenario; just an example from the fields.
@akegaviar - In your example, are the subscription (e.g. logsSubscribe
) and follow-up query (e.g. getTransaction
) being sent to the same node?
I'm trying to understand how there might be a race condition between nodes' chain state. Going from finalized to confirmed commitment level should provide a really wide window allowing all nodes to be able to observe chain state.
@akegaviar To me it is not surprising as most RPC clusters are "eventually consistent" - you might be using an RPC provider and be assigned to event subscription on machine A but request machine B. Machine B might not yet at the same state as A.
Distributed system are for the most part "eventually consistent" this is why it did not surprise me, and coding such a thing i would expect such behavior.
Unsure if this clarify what i meant.
In your example, are the subscription (e.g. logsSubscribe) and follow-up query (e.g. getTransaction) being sent to the same node?
This is a critical question before going any further. The two previous comments stitch together a potential problem scenario:
getTransaction
request from node BSuppose that node A is caught up with the tip of the cluster and node B is ~50 slots behind the tip. You could get a finalized
tx hash from node A. Then, when you go to query that tx
from node B, it could be the case that node B has not yet replayed that slot/tx. This could lead to node B being unable to return anything about that tx.
Depending on configuration, RPC nodes might be doing a lot of extra work and be under heavier load than "regular" nodes. We have definitely seen scenarios where heavy RPC traffic (requests) can cause an RPC node to fall behind. Rate limiting of nodes was left as something to perform outside of the validator client.
@bw-solana We were able to reproduce the issue when both logsSubscribe
and getTransaction
were sent to the same node.
I'm trying to understand how there might be a race condition between nodes' chain state
Wondering the same when we are hitting the same node and that too when logsSubscribe
using finalized
commitment level which takes a little longer than the confirmed
commitment level
Makes me think the Transaction Status Service must be backed up. I believe if you request tx details before TSS has processed the updates, RPC server will return null or error
I still cannot understand why it would necessitate introducing a sleep to fix this issue when it is not two different nodes serving websocket and http rpc
I still cannot understand why it would necessitate introducing a sleep to fix this issue when it is not two different nodes serving websocket and http rpc
I agree. The sleep is more of a workaround. We need to figure out what's going on.
My best guess is TSS being slow, but we'll need to confirm.
I have root-caused the issue. The problem is caused by the asynchronous nature of the logsSubscribe notification and the persistence of TransactionStatusMeta by the TransactionStatusService in the ReplayStage.
When a transaction is executed, the transaction logs are collected synchronously and the transaction status batch is sent to TransactionStatusService which will do the persistence asynchronously in a separate thread. Then logsSubscribe event (using CommitmentAggregationData --> NotificationEntry::Bank) is sent. Due to the asynchronously nature of these two processes, it may happen when getTransaction is being called the TransactionStatusMeta has not been written to the block store yet. We have the same mechanism for a long time so I do not think this is a regression.
Recommendations:
This might help a little https://github.com/anza-xyz/agave/pull/3026
Problem
We are noticing some latency with txn responses on our mainnet nodes running v1.18.23
finalized
confirmed
Now some of the hashes return a null and the rate of null we witness decreases when we introduce a sleep between
logSubscribe
andgetTransaction
. We are trying to understand why this would happen when both the websocket and http calls are being served by the same node.Proposed Solution