cloudmesh / cloudmesh-queue

Other
0 stars 0 forks source link

job: GPU test example #3

Open laszewsk opened 3 years ago

laszewsk commented 3 years ago

Hey Ketan:

I need in

https://github.com/cloudmesh/cloudmesh-job/tree/master/tests

A test that uses GPUs in jobs that I run on a local machine. I must be able to specify the number of parallel jobs that can maximal run on the host.

We can set this via command line or via a cms set variable or even in the yaml file. The later would be better. Also lets assume we have multiple hosts I must be able to determine how many jobs can run on the same time on a given GPU.

I am not yet sure if this is possible. Fugang do you have feedback on running multiple tasks at the same time on a GPU???

kpimparkar commented 3 years ago

I added a test which runs a dummy function and returns success if the function was run on a GPU. GPU number can be set by using cms set gpu=N.

Note: I couldn't test this new test on romeo yet as for some reason I get 'access denied: user ketanp has no active jobs' error. I am working on running the pytest in google colab. The independent code worked fine in colab.

Thank you!

laszewsk commented 3 years ago

a) both parameters make sense b) access to romeo is done via allocate otherwise you can not get the GPU. I think we have some manual in cloudmehs-iu

kpimparkar commented 3 years ago

a) okay, I will start working on those two parameters (gpu and max_jobs_allowed). b) r-allocate runs successfully for me, but if I try r-install or just r to access romeo via ssh, then I get the 'access denied' error.

laszewsk commented 3 years ago

we have to talk to sysadmin maybe you are not in allowed group?

laszewsk commented 3 years ago

hmm. we need to debug r-install and r first

access denied may mean many different things/

we need to do bare command access without r-allocate but everything by hand than we see if we can replicate error as this is what sysadmin needs, otherwise he will say our commands may not work

laszewsk commented 3 years ago

try commands on shell without alias. Here the aliases so you can see how to do it.

alias r-allocate='ssh ${JULIET} "salloc -p romeo --reservation=lijguo_11"' alias romeo='ssh -t ${JULIET} "ssh ${JHOST}"'

alias r='ssh -t ${JULIET} "ssh ${JHOST}"' alias j='ssh ${JULIET}'

laszewsk commented 3 years ago

any progress on the ssh command to log into romeo?

kpimparkar commented 3 years ago

I face the same error with manual execution of these commands:

keTan@DESKTOP-HUC37G2 MINGW64 ~
$ ssh ${juliet} "salloc -p romeo --reservation=lijguo_11"
salloc: Granted job allocation 14865

keTan@DESKTOP-HUC37G2 MINGW64 ~
$ ssh -t ${juliet} "ssh ${JHOST}"
Access denied: user ketanp (uid=13023) has no active jobs.
Authentication failed.
Connection to juliet.futuresystems.org closed.

keTan@DESKTOP-HUC37G2 MINGW64 ~
$ echo ${juliet} ${JHOST}
ketanp@juliet.futuresystems.org r-003

I can ssh into juliet, but I get same error if I try to ssh into r-003 from juliet:

[ketanp@j-login1 ~]$ ssh r-003
Access denied: user ketanp (uid=13023) has no active jobs.
Authentication failed.
kpimparkar commented 3 years ago
laszewsk commented 3 years ago

Policy at IU was changed. No longer just r-003 but also r-004. The cloudmesh-iu scripts are likely hardcoded to 003, they should be modified to allow dynamic finding of reservation. The interesting part is that I think I had that working before, but took it away as there was no need for it. likeley I lost the code. as such a long time ago and who knows, there may be other issues. THe lesson here is that we need to integrate dynamically finding info about the allocation

Here the help form Allan

Your job was allocated on node r-004, not r-003. You could also look at environment variable SLURM_NODELIST in terminal 2.

$ scontrol show job 14866 JobId=14866 JobName=bash UserId=ketanp(13023) GroupId=users(100) Priority=4294887009 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:34:13 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2020-12-27T15:18:29 EligibleTime=2020-12-27T15:18:29 StartTime=2020-12-27T15:18:29 EndTime=Unknown PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=romeo AllocNode:Sid=j-login1:9896 ReqNodeList=(null) ExcNodeList=(null) NodeList=r-004 BatchHost=r-004 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:: Socks/Node= NtasksPerN:B:S:C=0:0:: CoreSpec= MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=lijguo_11 Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=(null) WorkDir=/N/u/ketanp

laszewsk commented 3 years ago

scontrol show job 14866 | fgrep BatchHost