Open Simon-Harris-IBM opened 4 years ago
Are you running multiple submissions on one instance? This will become a tiny bit tricky as we don't have a good mechanism of tracking the GPUs in use.
I do have someone working on a prototype of sorts to create a lock file per gpu in use, but it's not an easy solution as it requires asynchronous updates.
We had intended to use 16 core, 2x GPU machines -- running 2 concurrent submissions per machine. But I'm thinking if this is going to be tricky that we should use 8 core, 1x GPU machines, and just run 1 submission per machine.
Incoming submissions need to be assigned to specific GPUs - which cannot be shared. Setting "gpus=all" on multiple submissions running at the same time causes at least one of the submissions to hang.