Open cswinter opened 2 years ago
Got some profiles now, a little hard to interpret with all the threads running in parallel, but the gist of it is that vast majority of time seems to be spent in thread synchronization overhead. There's three places that account for almost all of the time, a Condvar::wait
, and two mpsc::Receiver
calls.
The condvar is inside Bevy (task_pool.rs:150), not sure exactly what it's doing, maybe awaiting systems to finish running? From the timings, it looks like it's happening in parallel to the receive calls and might not actually on the critical paths.
146 thread_builder
147 .spawn(move || {
148 let shutdown_future = ex.run(shutdown_rx.recv());
149 // Use unwrap_err because we expect a Closed error
150 future::block_on(shutdown_future).unwrap_err(); // 45.6%
151 })
152 .expect("Failed to spawn thread.")
The first receive is the agents awaiting actions, the other is the environment awaiting observations. I think the reason why awaiting actions takes so much time here than awaiting observation is just an artifact of there being 128 threads running the apps and 4 threads collecting observations, though ratios still don't quite match up.
Worth trying a crossbeam channel implementation or something in case it's faster, but it seems like the most promising avenue would be to find a way to cut down on the number of threads and synchronization calls required. Maybe there's some way of running multiple Bevy instances in a single thread to batch cross-thread comms.
No significant difference from using crossbeam::channel::bounded
.
For reference, throughput is ~ 32K/s on an Intel i7-10875H.
The
agent
API is not 🚀 blazingly 🚀 fast 🚀 yet. Results from benchmarking both alow_level
andagent
/Bevy implementation of snake with benchmark.py:Already not bad, but still a 20x gap. 800K steps/s is kind of overkill, I think if we can get to 200K/s or something, this will be sufficient for most practical applications.
End-to-end training throughput with examples/bevy_snake/train.py is now ~6000. It was previously closer to 4000, I think what made the difference was making the
Player
aNonSend
resource to eliminate theMutex
on theTrainAgent
. This seems to have resulted in a ~4x speedup.With the
Mutex
:NonSend
resource without theMutext
:Will get some profiles next to see what parts we can optimize.