Closed jngrad closed 1 month ago
Weak scaling:
mpiexec -n 1 ./pypresso ../maintainer/benchmarks/lb.py --particles_per_core 1000 --lb_sites_per_particle 64
CPU: 9.4ms/loop
GPU before PR: 4.5ms/loop
GPU after PR: 4.5ms/loop
mpiexec -n 2 ./pypresso ../maintainer/benchmarks/lb.py --particles_per_core 1000 --lb_sites_per_particle 64
CPU: 14.9ms/loop
GPU after PR: 13.6ms/loop
GPU before PR: 63.6ms/loop
mpiexec -n 4 ./pypresso ../maintainer/benchmarks/lb.py --particles_per_core 1000 --lb_sites_per_particle 64
CPU: 17.2ms/loop
GPU after PR: 22.2ms/loop
GPU before PR: 138.7ms/loop
The speed remains unchanged on 1 MPI rank, because GPUPackInfo
already implemented a bufferless device-to-device copy operation when the send and receive blocks belong to the same MPI rank.
Description of changes: