PoH - Reserve Space - Githubissues

apfitzge commented 5 months ago

Problem

In transaction processing, one of the main bottlenecks for simple transactions (such as in bench-tps) is recording transactions into PoH.
The reason this is slow is because of the communication pattern between execution and poh threads:
- execution thread sends message to poh thread
- poh thread, occasionally checks channel for messages from execution threads
- records any messages from execution threads in order and sends messages back
The slowness mainly comes from the fact that the poh thread does not constantly check the channel from execution threads. It only does so at specific intervals between hash batches. By default the poh thread will do 64 hashes between each checking of the channel (only hashing if channel is empty).

Assuming cluster average hash-rate of 10M hashes/sec. This means the interval of checking the channel is around 6.4us.

Proposed Solution

Create a mechanism for execution threads to "reserve" poh space.
Reserve should not fail if there is still time left in the block, poh wise.
Before the block ends, poh thread will "flip a bit" to prevent any further reservations, and then record all messages before the last tick is inserted.
This will allow execution threads to more quickly move on, since they do not need to wait for the actual recording into poh to occur - just the "reserve" operation.

Need to compare alternate implementations:

Atomic using CAS for reserve mechanism
Mutex for reserve mechanism

ryoqun commented 5 months ago

oh, i think this can be solved more elegantly just with https://github.com/crossbeam-rs/crossbeam/pull/959 (cc @behzadnouri; he also came up with the similar idea: https://github.com/crossbeam-rs/crossbeam/issues/861)

the idea is that we can guarantee block insertion just just from the simple fact of succeeding in (one-sided) sending to the poh thread. so, no need to wait replies from the poh thread.

that said, that upstream pr is kind of stale. we can patch the crossbeam-channel or just wrap the crossbeam channel with mutex (ugh).

apfitzge commented 5 months ago

the idea is that we can guarantee block insertion just just from the simple fact of succeeding in (one-sided) sending to the poh thread. so, no need to wait replies from the poh thread.

Exactly, the whole point of this issue is to have this guarantee. If we can achieve that in another than suggested, I'm happy.

I'm not sure I understand the suggestion w/ crossbeam though. At least currently we use the same channel to send to poh thread throughout the lifetime; do you mean use the disconnect approach, and need to swap out channels for every new slot?

ryoqun commented 5 months ago

do you mean use the disconnect approach, and need to swap out channels for every new slot?

exactly. so that senders can be notified about end of slots.

apfitzge commented 5 months ago

Yeah that seems like an approach that'd work. I think Mutex-wrapping the channels is not an option. Is there any expected timeline on those upstream PRs? they seem to have been there for quite a long time.

ryoqun commented 5 months ago

I think Mutex-wrapping the channels is not an option.

Well, I think we need some bench to preclude that. Now that Mutex is futex-based, it's not so different from having another crossbeam channel. I'm not saying this is a permanent solution, though. Also, come to think of it, RwLock could be used to mitigate contentions among the external (i.e. tx-sending) threads, because it's guaranteed that there's the single receiver.

Is there any expected timeline on those upstream PRs? they seem to have been there for quite a long time.

Sadly, none. Seems the maintainer is too busy... I'm replying as fast as i can.

At this moment, i think patching or forking makes sense.

So, ultimately, which solution (reserve, Mutex, or disconnect) to be taken could largely depends on the enginener's preference, i guess. Fortunately, this isn't consensus change and the change is quite localized. so, relatively easy to switch the solutions.

apfitzge commented 5 months ago

Also, come to think of it, RwLock could be used to mitigate contentions among the external (i.e. tx-sending) threads, because it's guaranteed that there's the single receiver.

My main contention with mutex was blocking of execution threads by other execution threads. However, even with using a RwLock the execution threads will be blocked by the recording thread.

I guess I'm also not quite sure how this approach would work. In crossbeam the Sender and Receiver are separate objects/types - a lock around the Receiver would not affect the Sender afaik - unless we do some sort of Channel wrapper that holds the lock around both.

ryoqun commented 5 months ago

However, even with using a RwLock the execution threads will be blocked by the recording thread.

recording thread will only write-lock at the end of slots. But as said below, use of RwLock is awkward.

I guess I'm also not quite sure how this approach would work. In crossbeam the Sender and Receiver are separate objects/types

ohh, true.

a lock around the Receiver would not affect the Sender afaik

this is true as well

unless we do some sort of Channel wrapper that holds the lock around both.

Yeah, that works. but it's odd even more. Instead, I think we need is_reached_max_height: Arc<RwLock<bool>> besides senders/receivers. i.e. an indirect way to create a critical section by accompanying this synchronization along with them at the cost of losing type safety to some extent. this is like insert_shreds_lock: Mutex<()> in solana-ledger.

ryoqun commented 4 months ago

Instead, I think we need is_reached_max_height: Arc<RwLock<bool>> besides senders/receivers. i.e. an indirect way to create a critical section by accompanying this synchronization along with them

I created such a wrapper: https://gist.github.com/ryoqun/80adbd49c1658812b47fcf35bacb058f

ryoqun commented 4 months ago

further update: I created revamped pr upstream: https://github.com/crossbeam-rs/crossbeam/pull/1114

apfitzge commented 3 months ago

Reviewing this, I think we still need/want some reserve method instead. We currently return the transaction index w/ the result of recording which the worker threads use for something(?).

I'm imaginging we could do something like having a Mutex<Option<usize>> or AtomicI64. Use lock or CAS to get the current index for the batch we're recording, if it's None or negative then we consider the recording to fail.

alessandrod commented 2 weeks ago

The slowness mainly comes from the fact that the poh thread does not constantly check the channel from execution threads. It only does so at specific intervals between hash batches. By default the poh thread will do 64 hashes between each checking of the channel (only hashing if channel is empty).

I think the slowness comes from the fact that when the poh thread sends results back, the banking threads are always sleeping so it needs to syscall to wake them up. Ban syscalls.

apfitzge commented 2 weeks ago

I think the slowness comes from the fact that when the poh thread sends results back, the banking threads are always sleeping so it needs to syscall to wake them up. Ban syscalls.

Yeah. those threads are always sleeping. Communication should be "largely" one way. Workers send to stuff to PoH, but get back immediately as part of send process the info it needs to continue. No messages from PoH, just something to flag the end of slot so banks dont' record past end of slot.

alessandrod commented 2 weeks ago

This sounds like a fixed capacity ringbuffer with head and tail that can advance independently without contention

apfitzge commented 2 weeks ago

This sounds like a fixed capacity ringbuffer with head and tail that can advance independently without contention

yeah ideally we could have some sort of ring-buffer, large enough we expect to never fail to insert. still need a way to communicate "slot is over" to workers, as well as the transaction_index stuff that's used for something.

anza-xyz / agave

PoH - Reserve Space #1227

Problem

Proposed Solution