Performance - Githubissues

cswinter commented 2 years ago

The agent API is not 🚀 blazingly 🚀 fast 🚀 yet. Results from benchmarking both a low_level and agent/Bevy implementation of snake with benchmark.py:

Benchmarking multisnake...
Wall time: 2.97s
Throughput: 860.65K steps/s
Benchmarking bevy_snake_enn...
Wall time: 3.38s
Throughput: 37.86K steps/s
Total send: 110 ms
Total wait: 855 ms
Total collect: 24 ms
Total send: 12 ms
Total wait: 3053 ms
Total collect: 2 ms

Already not bad, but still a 20x gap. 800K steps/s is kind of overkill, I think if we can get to 200K/s or something, this will be sufficient for most practical applications.

End-to-end training throughput with examples/bevy_snake/train.py is now ~6000. It was previously closer to 4000, I think what made the difference was making the Player a NonSend resource to eliminate the Mutex on the TrainAgent. This seems to have resulted in a ~4x speedup.

With the Mutex:

NonSend resource without the Mutext:

Will get some profiles next to see what parts we can optimize.

cswinter commented 2 years ago

Got some profiles now, a little hard to interpret with all the threads running in parallel, but the gist of it is that vast majority of time seems to be spent in thread synchronization overhead. There's three places that account for almost all of the time, a Condvar::wait, and two mpsc::Receiver calls.

The condvar is inside Bevy (task_pool.rs:150), not sure exactly what it's doing, maybe awaiting systems to finish running? From the timings, it looks like it's happening in parallel to the receive calls and might not actually on the critical paths.

146                 thread_builder      
147                     .spawn(move || {        
148                         let shutdown_future = ex.run(shutdown_rx.recv());       
149                         // Use unwrap_err because we expect a Closed error      
150                         future::block_on(shutdown_future).unwrap_err(); //        45.6%
151                     })      
152                     .expect("Failed to spawn thread.")

The first receive is the agents awaiting actions, the other is the environment awaiting observations. I think the reason why awaiting actions takes so much time here than awaiting observation is just an artifact of there being 128 threads running the apps and 4 threads collecting observations, though ratios still don't quite match up.

Worth trying a crossbeam channel implementation or something in case it's faster, but it seems like the most promising avenue would be to find a way to cut down on the number of threads and synchronization calls required. Maybe there's some way of running multiple Bevy instances in a single thread to batch cross-thread comms.

cswinter commented 2 years ago

No significant difference from using crossbeam::channel::bounded.

cswinter commented 2 years ago

For reference, throughput is ~ 32K/s on an Intel i7-10875H.

entity-neural-network / entity-gym-rs

Performance #2