Open laszewsk opened 3 years ago
I added a test which runs a dummy function and returns success if the function was run on a GPU. GPU number can be set by using cms set gpu=N
.
cms job run
, we need to modify the job to accept the gpu as an argument. If this approach is ok, we can modify the .yaml used by job
to store gpu number for each configured host.job_counter
for each host in the .yaml of job
. We can add another parameter as max_jobs_allowed
and validation can be added in cms job run
to prevent the execution of jobs exceeding max_jobs_allowed
. Please convey if this an acceptable approach.Note: I couldn't test this new test on romeo yet as for some reason I get 'access denied: user ketanp has no active jobs' error. I am working on running the pytest in google colab. The independent code worked fine in colab.
Thank you!
a) both parameters make sense b) access to romeo is done via allocate otherwise you can not get the GPU. I think we have some manual in cloudmehs-iu
a) okay, I will start working on those two parameters (gpu and max_jobs_allowed). b) r-allocate runs successfully for me, but if I try r-install or just r to access romeo via ssh, then I get the 'access denied' error.
we have to talk to sysadmin maybe you are not in allowed group?
hmm. we need to debug r-install and r first
access denied may mean many different things/
we need to do bare command access without r-allocate but everything by hand than we see if we can replicate error as this is what sysadmin needs, otherwise he will say our commands may not work
try commands on shell without alias. Here the aliases so you can see how to do it.
alias r-allocate='ssh ${JULIET} "salloc -p romeo --reservation=lijguo_11"' alias romeo='ssh -t ${JULIET} "ssh ${JHOST}"'
alias r='ssh -t ${JULIET} "ssh ${JHOST}"' alias j='ssh ${JULIET}'
any progress on the ssh command to log into romeo?
I face the same error with manual execution of these commands:
keTan@DESKTOP-HUC37G2 MINGW64 ~
$ ssh ${juliet} "salloc -p romeo --reservation=lijguo_11"
salloc: Granted job allocation 14865
keTan@DESKTOP-HUC37G2 MINGW64 ~
$ ssh -t ${juliet} "ssh ${JHOST}"
Access denied: user ketanp (uid=13023) has no active jobs.
Authentication failed.
Connection to juliet.futuresystems.org closed.
keTan@DESKTOP-HUC37G2 MINGW64 ~
$ echo ${juliet} ${JHOST}
ketanp@juliet.futuresystems.org r-003
I can ssh into juliet, but I get same error if I try to ssh into r-003 from juliet:
[ketanp@j-login1 ~]$ ssh r-003
Access denied: user ketanp (uid=13023) has no active jobs.
Authentication failed.
max_jobs_allowed
is added at the host level. Validation is modified to compare job_counter with max_jobs_allowed while selecting a host IP to run the job. Documentation is updated. gpu
is already available at jobs level. This will allow us to choose the GPU instance for each job. I will work on testing this code and adding a pytest for the same.Policy at IU was changed. No longer just r-003 but also r-004. The cloudmesh-iu scripts are likely hardcoded to 003, they should be modified to allow dynamic finding of reservation. The interesting part is that I think I had that working before, but took it away as there was no need for it. likeley I lost the code. as such a long time ago and who knows, there may be other issues. THe lesson here is that we need to integrate dynamically finding info about the allocation
Here the help form Allan
Your job was allocated on node r-004, not r-003. You could also look at environment variable SLURM_NODELIST in terminal 2.
$ scontrol show job 14866 JobId=14866 JobName=bash UserId=ketanp(13023) GroupId=users(100) Priority=4294887009 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:34:13 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2020-12-27T15:18:29 EligibleTime=2020-12-27T15:18:29 StartTime=2020-12-27T15:18:29 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=romeo AllocNode:Sid=j-login1:9896 ReqNodeList=(null) ExcNodeList=(null) NodeList=r-004 BatchHost=r-004 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:: Socks/Node= NtasksPerN:B:S:C=0:0:: CoreSpec= MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=lijguo_11 Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/N/u/ketanp
scontrol show job 14866 | fgrep BatchHost
Hey Ketan:
I need in
https://github.com/cloudmesh/cloudmesh-job/tree/master/tests
A test that uses GPUs in jobs that I run on a local machine. I must be able to specify the number of parallel jobs that can maximal run on the host.
We can set this via command line or via a cms set variable or even in the yaml file. The later would be better. Also lets assume we have multiple hosts I must be able to determine how many jobs can run on the same time on a given GPU.
I am not yet sure if this is possible. Fugang do you have feedback on running multiple tasks at the same time on a GPU???