job: GPU test example - Githubissues

laszewsk commented 3 years ago

Hey Ketan:

I need in

https://github.com/cloudmesh/cloudmesh-job/tree/master/tests

A test that uses GPUs in jobs that I run on a local machine. I must be able to specify the number of parallel jobs that can maximal run on the host.

We can set this via command line or via a cms set variable or even in the yaml file. The later would be better. Also lets assume we have multiple hosts I must be able to determine how many jobs can run on the same time on a given GPU.

I am not yet sure if this is possible. Fugang do you have feedback on running multiple tasks at the same time on a GPU???

kpimparkar commented 3 years ago

I added a test which runs a dummy function and returns success if the function was run on a GPU. GPU number can be set by using cms set gpu=N.

To run a job on GPU using cms job run, we need to modify the job to accept the gpu as an argument. If this approach is ok, we can modify the .yaml used by job to store gpu number for each configured host.
To control the number of jobs: we maintain job_counter for each host in the .yaml of job. We can add another parameter as max_jobs_allowed and validation can be added in cms job run to prevent the execution of jobs exceeding max_jobs_allowed. Please convey if this an acceptable approach.

Note: I couldn't test this new test on romeo yet as for some reason I get 'access denied: user ketanp has no active jobs' error. I am working on running the pytest in google colab. The independent code worked fine in colab.

Thank you!

laszewsk commented 3 years ago

a) both parameters make sense b) access to romeo is done via allocate otherwise you can not get the GPU. I think we have some manual in cloudmehs-iu

kpimparkar commented 3 years ago

a) okay, I will start working on those two parameters (gpu and max_jobs_allowed). b) r-allocate runs successfully for me, but if I try r-install or just r to access romeo via ssh, then I get the 'access denied' error.

laszewsk commented 3 years ago

we have to talk to sysadmin maybe you are not in allowed group?

laszewsk commented 3 years ago

hmm. we need to debug r-install and r first

access denied may mean many different things/

we need to do bare command access without r-allocate but everything by hand than we see if we can replicate error as this is what sysadmin needs, otherwise he will say our commands may not work

laszewsk commented 3 years ago

try commands on shell without alias. Here the aliases so you can see how to do it.

alias r-allocate='ssh ${JULIET} "salloc -p romeo --reservation=lijguo_11"' alias romeo='ssh -t ${JULIET} "ssh ${JHOST}"'

alias r='ssh -t ${JULIET} "ssh ${JHOST}"' alias j='ssh ${JULIET}'

laszewsk commented 3 years ago

any progress on the ssh command to log into romeo?

kpimparkar commented 3 years ago

I face the same error with manual execution of these commands:

keTan@DESKTOP-HUC37G2 MINGW64 ~
$ ssh ${juliet} "salloc -p romeo --reservation=lijguo_11"
salloc: Granted job allocation 14865

keTan@DESKTOP-HUC37G2 MINGW64 ~
$ ssh -t ${juliet} "ssh ${JHOST}"
Access denied: user ketanp (uid=13023) has no active jobs.
Authentication failed.
Connection to juliet.futuresystems.org closed.

keTan@DESKTOP-HUC37G2 MINGW64 ~
$ echo ${juliet} ${JHOST}
ketanp@juliet.futuresystems.org r-003

I can ssh into juliet, but I get same error if I try to ssh into r-003 from juliet:

[ketanp@j-login1 ~]$ ssh r-003
Access denied: user ketanp (uid=13023) has no active jobs.
Authentication failed.

kpimparkar commented 3 years ago

Romeo access to r-004 is now available.
max_jobs_allowed is added at the host level. Validation is modified to compare job_counter with max_jobs_allowed while selecting a host IP to run the job. Documentation is updated.
parameter gpu is already available at jobs level. This will allow us to choose the GPU instance for each job. I will work on testing this code and adding a pytest for the same.

laszewsk commented 3 years ago

Policy at IU was changed. No longer just r-003 but also r-004. The cloudmesh-iu scripts are likely hardcoded to 003, they should be modified to allow dynamic finding of reservation. The interesting part is that I think I had that working before, but took it away as there was no need for it. likeley I lost the code. as such a long time ago and who knows, there may be other issues. THe lesson here is that we need to integrate dynamically finding info about the allocation

Here the help form Allan

Your job was allocated on node r-004, not r-003. You could also look at environment variable SLURM_NODELIST in terminal 2.

$ scontrol show job 14866 JobId=14866 JobName=bash UserId=ketanp(13023) GroupId=users(100) Priority=4294887009 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:34:13 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2020-12-27T15:18:29 EligibleTime=2020-12-27T15:18:29 StartTime=2020-12-27T15:18:29 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=romeo AllocNode:Sid=j-login1:9896 ReqNodeList=(null) ExcNodeList=(null) NodeList=r-004 BatchHost=r-004 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:: Socks/Node= NtasksPerN:B:S:C=0:0:: CoreSpec= MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=lijguo_11 Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/N/u/ketanp

laszewsk commented 3 years ago

scontrol show job 14866 | fgrep BatchHost

cloudmesh / cloudmesh-queue

job: GPU test example #3