Implement async io - Githubissues

The main improvement in this PR comes from the use of a new AsyncIOComm class that uses extra MPI ranks to interleave IO read/write operations between frames while extracting a whole exposure. There were a few improvements in other gpu functions as well as some refactoring of the spex command line program into a module function that came along for the ride.

The tables below shows before and after results using the 30 frame exposure extract script using a single node with 4 GPUs and 2 MPI ranks per GPU on corigpu (5 MPI ranks per GPU on dgx).

Before:

system	io	elapsed time (sec)	FPNH	FPGH
corigpu	sync	422.5	255.64	63.91
dgx	sync	244.4	441.89	110.47

This PR:

system	io	elapsed time (sec)	FPNH	FPGH	Improvement
corigpu	sync	362.9	297.57	74.39	1.16x
corigpu	async	329.4	327.86	81.97	1.28x
dgx	sync	222.0	486.39	121.60	1.10x
dgx	async	170.0	635.14	158.78	1.44x

Cori GPU commands:

time srun -n 8 -c 2 --cpu-bind=cores mps-wrapper desi-extract-exposure ${INDIR} ${JOBOUTDIR} $(date +%s) --night ${NIGHT} --expid ${EXPID} --gpu
time srun -n 10 -c 2 --cpu-bind=cores mps-wrapper desi-extract-exposure ${INDIR} ${JOBOUTDIR} $(date +%s) --night ${NIGHT} --expid ${EXPID} --gpu --async-io

DGX commands:

time srun -n 20 -c 2 --cpu-bind=cores mps-wrapper desi-extract-exposure ${INDIR} ${JOBOUTDIR} $(date +%s) --night ${NIGHT} --expid ${EXPID} --gpu
time srun -n 22 -c 2 --cpu-bind=cores mps-wrapper desi-extract-exposure ${INDIR} ${JOBOUTDIR} $(date +%s) --night ${NIGHT} --expid ${EXPID} --gpu --async-io

Unit tests are passing:

(gpu-specter-dev) dmargala@cgpu15:gpu_specter> srun -n 1 -c 2 --cpu-bind=cores python -m unittest gpu_specter.test.test_suite
.......................
----------------------------------------------------------------------
Ran 23 tests in 34.212s

OK

desihub / gpu_specter

Implement async io #56