flatironinstitute / disBatch

Tool to distribute a list of computational tasks over a pool of compute resources. The pool can grow or shrink.
Apache License 2.0
39 stars 8 forks source link

Add support for GPU env variables #10

Closed dylex closed 5 years ago

dylex commented 5 years ago

Using generic support for splitting environment-specified resources across tasks.

jamesjun commented 5 years ago

I am excited to try it Dylan. Could you please provide a usage example of using GPU? My guess is adding the following switches: -p gpu --gres=gpu:# Question: does the # in -gres correspond to the number of GPU per task or number of GPU in total? Is the number of tasks limited by the total number of available GPUs?

Do you think this would work? sbatch -n 16 -p ccb --qos ccb -c 5 -p gpu --gres=gpu:16 --exclusive --wrap'; %--ntasks-per-node 5 mybatch.sh

dylex commented 5 years ago

If you want to run on n nodes, with t tasks per node, each using c CPUs and 1 GPU (for a total of tc CPUs and t GPUs per node, or ntc total CPUs and nt total GPUs), you'd do: sbatch -N$n -c$c --ntasks-per-node=$t --gres=gpu:$t -p gpu --wrap 'disBatch.py -g $taskfile' Do not specify exclusive.

jamesjun commented 5 years ago

Excellent. Can I try it now or should I wait until Nick completes the review? Would it be okay to unload 'disBatch' module (v1.3) and add path to your version of disBatch.py?

dylex commented 5 years ago

If you'd like. Probably better not to use my version directly, in case I change things, but you can certainly clone this repo and run from there.

jamesjun commented 5 years ago

I get this error: sbatch: error: Batch job submission failed: Requested node configuration is not available

when I ran the command below: sbatch -N16 -c1 --ntasks-per-node=5 --gres=gpu:5 -p gpu --wrap 'disBatch.py -g /mnt/ceph/users/jjun/groundtruth_irc/bionet/bionet_static/irc_v4.2.6.disbatch'

I tried to install setup.py after cloning disBatch.py on the cluster but it gave me a permission error below: jjun@workergpu05:disBatch$ python setup.py install running install running build running build_scripts creating build creating build/scripts-2.7 copying and adjusting disBatch.py -> build/scripts-2.7 changing mode of build/scripts-2.7/disBatch.py from 664 to 775 running install_scripts copying build/scripts-2.7/disBatch.py -> /usr/bin error: /usr/bin/disBatch.py: Permission denied I made sure disBatch.py is called from the github clone: jjun@workergpu05:src$ which disBatch.py ~/src/disBatch/disBatch.py

Any suggestion would be appreciated.

dylex commented 5 years ago

For install: the default install for python packages requires root. You probably want --user, or just run directly out of the clone.

This is now specific to our cluster, so we should probably take it off-line, but FI doesn't have 16 nodes with 5 GPUs. See the cluster docs.

njcarriero commented 5 years ago

DS: Thanks for the modified updates.