cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
134 stars 120 forks source link

--gpus=0 still doesnt work #2718

Closed misterbrandonwalker closed 3 years ago

misterbrandonwalker commented 3 years ago

Hello on version 7.3.4 (latest on conda), setting gpu=0 does not tell worker to not use any gpu, it still tries to use gpu.

btovar commented 3 years ago

@bdw2292, I've just checked the code, and it is the same that worked when we close this issue: https://github.com/cooperative-computing-lab/cctools/issues/2669

What are you observing now? Currently if you set gpus=0 at a worker, that's the value sent to the manager. If your tasks have t.specify_gpus(1), then they will not be sent to that worker.

If you don't specify t.specify_gpus(1), then work queue will not control access to the gpus.

Could you post the logs of worker and mananger? I am looking for lines in the manager's log when the worker connect. They should look something like:

2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): info worker-id worker-167ccbbff007f57030d78e85c9f3ca8e
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): alive
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): resource workers 1 1 1
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): resource disk 9131 9131 9131
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): resource memory 8000 8000 8000
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): resource gpus 0 0 0
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): resource cores 4 4 4
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): resource tag 0
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): info end_of_resource_update 0
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): info tasks_running 0
2021/10/13 15:22:51.14 work_queue_python[26740] wq: rx from somehost.somedomain.edu (10.32.74.132:59724): info worker-end-time 0
2021/10/13 15:22:51.17 work_queue_python[26740] tcp: got connection from 10.32.73.228 port 50878
misterbrandonwalker commented 3 years ago

Hello,

Yes I set the task to require 1 GPU. However I expect the worker to report not having any GPU available if I use input --gpus 0. For now I am getting around the issue by having a seperate queue for CPU jobs (and seperate work_queue_workers) and GPU jobs, where on nodes I dont want GPU jobs, I just dont have a GPU work_queue_worker connecting to the GPU queue. Previously they were in same queue and this would cause GPU jobs to be assigned on node with --gpus 0.

Input for calling workers 2021-10-22 18:13:36,679 INFO Calling: ssh node59 "source /home/bdw2292/.allpurpose.bashrc ;mkdir /scratch/bdw2292 ; work_queue_worker nova 9121 --workdir /scratch/bdw2292 -d all -o /scratch/bdw2292/worker.debug --cores 2 --memory 100 --gpus 0 -t 100000000 -M bdw2292_maingpuqueue_RenLabCluster" 2021-10-22 18:13:36,682 INFO Calling: ssh node72 "source /home/bdw2292/.allpurpose.bashrc ;mkdir /scratch/bdw2292 ; work_queue_worker nova 9121 --workdir /scratch/bdw2292 -d all -o /scratch/bdw2292/worker.debug --cores 2 --memory 100 --gpus 0 -t 100000000 -M bdw2292_maingpuqueue_RenLabCluster" 2021-10-22 18:13:36,684 INFO Calling: ssh node73 "source /home/bdw2292/.allpurpose.bashrc ;mkdir /scratch/bdw2292 ; work_queue_worker nova 9121 --workdir /scratch/bdw2292 -d all -o /scratch/bdw2292/worker.debug --cores 2 --memory 100 --gpus 0 -t 100000000 -M bdw2292_maingpuqueue_RenLabCluster" 2021-10-22 18:13:36,686 INFO Calling: ssh bme-dna "source /home/bdw2292/.allpurpose.bashrc ;mkdir /scratch/bdw2292 ; work_queue_worker nova 9121 --workdir /scratch/bdw2292 -d all -o /scratch/bdw2292/worker.debug --cores 2 --memory 100 --gpus 0 -t 100000000 -M bdw2292_maingpuqueue_RenLabCluster" 2021-10-22 18:13:36,688 INFO Calling: ssh bme-sugar "source /home/bdw2292/.allpurpose.bashrc ;mkdir /scratch/bdw2292 ; work_queue_worker nova 9121 --workdir /scratch/bdw2292 -d all -o /scratch/bdw2292/worker.debug --cores 2 --memory 100 --gpus 0 -t 100000000 -M bdw2292_maingpuqueue_RenLabCluster" 2021-10-22 18:13:36,700 INFO Waiting for input jobs 2021-10-22 18:15:51,835 INFO Submitting tasks... 2021-10-22 18:15:51,840 INFO Task ID of 1 is assigned to job dynamic_gpu Octan-1-ol_Simulation_SolvSimEle1_Vdw1_solvwaterboxproddyn.xyz -k Octan-1-ol_Simulation_SolvSimEle1_Vdw1_solvwaterboxproddyn.key 2500000 2 2 4 300 1 > Octan-1-ol_Simulation_SolvSimEle1_Vdw1.out 2021-10-22 18:15:51,840 INFO Submitting tasks... 2021-10-22 18:15:51,841 INFO Task ID of 2 is assigned to job dynamic_gpu Octan-1-ol_Simulation_SolvSimEle0-8_Vdw1_solvwaterboxproddyn.xyz -k Octan-1-ol_Simulation_SolvSimEle0-8_Vdw1_solvwaterboxproddyn.key 2500000 2 2 4 300 1 > Octan-1-ol_Simulation_SolvSimEle0.8_Vdw1.out 2021-10-22 18:15:51,842 INFO Submitting tasks... 2021-10-22 18:15:51,843 INFO Task ID of 3 is assigned to job dynamic_gpu Octan-1-ol_Simulation_SolvSimEle0-7_Vdw1_solvwaterboxproddyn.xyz -k Octan-1-ol_Simulation_SolvSimEle0-7_Vdw1_solvwaterboxproddyn.key 2500000 2 2 4 300 1 > Octan-1-ol_Simulation_SolvSimEle0.7_Vdw1.out

Output from queue 2021/10/22 18:15:57.68 work_queue_python[34823] wq: rx from node72.bme.utexas.edu (10.0.0.72:46768): info worker-id worker-fd248cf27c89f7a7b5b2ec20ac96fef4 2021/10/22 18:15:57.68 work_queue_python[34823] wq: rx from node73.bme.utexas.edu (10.0.0.73:44262): info worker-id worker-db19074408aa56bf75cfdb116c72e34a 2021/10/22 18:15:57.68 work_queue_python[34823] wq: rx from node59.bme.utexas.edu (10.0.0.59:58262): info worker-id worker-f9c4848c009d66321ba399213a54e110 2021/10/22 18:15:57.68 work_queue_python[34823] wq: rx from node60.bme.utexas.edu (10.0.0.60:51326): info worker-id worker-e0d2cd60d4321d611d0891a8dc512714 2021/10/22 18:15:57.68 work_queue_python[34823] wq: rx from bme-sugar.bme.utexas.edu (146.6.132.119:55222): feature GeForce%20GTX%20970% A 2021/10/22 18:15:57.68 work_queue_python[34823] wq: Feature found: GeForce GTX 970 2021/10/22 18:15:57.68 work_queue_python[34823] wq: rx from node72.bme.utexas.edu (10.0.0.72:46768): feature GeForce%20GTX%20970% A 2021/10/22 18:15:57.68 work_queue_python[34823] wq: Feature found: GeForce GTX 970 2021/10/22 18:15:57.68 work_queue_python[34823] wq: rx from node73.bme.utexas.edu (10.0.0.73:44262): feature GeForce%20GTX%20970% A 2021/10/22 18:15:57.68 work_queue_python[34823] wq: Feature found: GeForce GTX 970 2021/10/22 18:15:57.68 work_queue_python[34823] wq: rx from node59.bme.utexas.edu (10.0.0.59:58262): feature GeForce%20GTX%20970% A 2021/10/22 18:15:57.68 work_queue_python[34823] wq: Feature found: GeForce GTX 970 2021/10/22 18:15:57.68 work_queue_python[34823] wq: rx from node60.bme.utexas.edu (10.0.0.60:51326): feature GeForce%20RTX%202070% A 2021/10/22 18:15:57.68 work_queue_python[34823] wq: Feature found: GeForce RTX 2070

2021/10/22 18:15:57.73 work_queue_python[34823] wq: tx to node72.bme.utexas.edu (10.0.0.72:46768): infile file-162-1f97413d7725924575137bda90e0f98f-Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.key Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.key 0 2021/10/22 18:15:57.73 work_queue_python[34823] wq: tx to node72.bme.utexas.edu (10.0.0.72:46768): infile file-163-7038c3e72ed88b544c149ff5f8155390-amoebabio18.prm amoebabio18.prm 0 2021/10/22 18:15:57.73 work_queue_python[34823] wq: tx to node72.bme.utexas.edu (10.0.0.72:46768): infile file-0-42c4399822e958c1e1bbf23c780c39d4-resource_monitor cctools-monitor 1 2021/10/22 18:15:57.73 work_queue_python[34823] wq: tx to node72.bme.utexas.edu (10.0.0.72:46768): outfile file-164-57f41f67c814102ccbcd4ab5c49e0663-Octan-1-ol_Simulation_SolvSimEle0_Vdw0.45.out Octan-1-ol_Simulation_SolvSimEle0_Vdw0.45.out 0 2021/10/22 18:15:57.73 work_queue_python[34823] wq: tx to node72.bme.utexas.edu (10.0.0.72:46768): outfile file-165-f67efe477b0ea78a18418a8de2a793d2-Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.arc Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.arc 0 2021/10/22 18:15:57.73 work_queue_python[34823] wq: tx to node72.bme.utexas.edu (10.0.0.72:46768): outfile file-166-5da32b2db6d07474201f467023793c10-Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.dyn Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.dyn 0 2021/10/22 18:15:57.73 work_queue_python[34823] wq: tx to node72.bme.utexas.edu (10.0.0.72:46768): outfile file-168-cd98fb76da91aac5387d90fb655f8093-wq-34823-task-21.summary cctools-monitor.summary 0 2021/10/22 18:15:57.73 work_queue_python[34823] wq: tx to node72.bme.utexas.edu (10.0.0.72:46768): end 2021/10/22 18:15:57.73 work_queue_python[34823] wq: node72.bme.utexas.edu (10.0.0.72:46768) busy on 'dynamic_gpu Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.xyz -k Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.key 2500000 2 2 4 300 1 > Octan-1-ol_Simulation_SolvSimEle0_Vdw0.45.out'

Sample Output from worker 2021/10/22 18:15:10.33 work_queue_worker[31400] wq: rx from manager: ./cctools-monitor --no-pprint --with-output-files=cctools-monitor -L 'memory: 1' -L 'disk: 1' -L 'gpus: 1' -L 'cores: 0.000' -V 'task_id: 21' -V 'category: default' --measure-only --sh "dynamic_gpu Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.xyz -k Octan-1-ol_Simulation_SolvSimEle0_Vdw0-45_solvwaterboxproddyn.key 2500000 2 2 4 300 1 > Octan-1-ol_Simulation_SolvSimEle0_Vdw0.45.out" 2021/10/22 18:15:10.33 work_queue_worker[31400] wq: rx from manager: category default 2021/10/22 18:15:10.33 work_queue_worker[31400] wq: rx from manager: cores 0.000 2021/10/22 18:15:10.33 work_queue_worker[31400] wq: rx from manager: gpus 1 2021/10/22 18:15:10.33 work_queue_worker[31400] wq: rx from manager: memory 1 2021/10/22 18:15:10.33 work_queue_worker[31400] wq: rx from manager: disk 1

2021/10/22 18:12:52.26 work_queue_worker[31408] wq: disk 0 inuse 897051 total 897051 smallest 897051 largest 2021/10/22 18:12:52.26 work_queue_worker[31408] wq: memory 0 inuse 125500 total 125500 smallest 125500 largest 2021/10/22 18:12:52.26 work_queue_worker[31408] wq: gpus 0 inuse 1 total 1 smallest 1 largest 2021/10/22 18:12:52.26 work_queue_worker[31408] wq: cores 0 inuse 8 total 8 smallest 8 largest

btovar commented 3 years ago

@bdw2292 thanks for the logs. The worker is indeed telling the manager that it has 1 gpu. Let me corroborate there was not a regression somewhere.

btovar commented 3 years ago

@bdw2292 For some reason I can't reproduce it, even when I force the gpu autodetection:

$ work_queue_worker --version
work_queue_worker version 7.3.4 FINAL (released 2021-10-20 12:33:50 +0000)
...

# implicit autodetection
$ work_queue_worker localhost 9129 -dall |& grep gpus
2021/10/25 08:53:46.88 work_queue_worker[24567] wq:     gpus      0 inuse      4 total      4 smallest      4 largest
# manager sees: resource gpus 4 4 4

# explicit autodetection
$ work_queue_worker localhost 9129 -dall --gpus -1 |& grep gpus
2021/10/25 08:54:38.02 work_queue_worker[24645] wq:     gpus      0 inuse      4 total      4 smallest      4 largest
# manager sees: resource gpus 4 4 4

# explicit no gpus
$ work_queue_worker localhost 9129 -dall --gpus 0 |& grep gpus
2021/10/25 08:55:07.98 work_queue_worker[24690] wq:     gpus      0 inuse      0 total      0 smallest      0 largest
# manager sees: resource gpus 0 0 0

Is it possible that you are using a worker from a different cctools version? Or that we are looking at the log of another worker?

misterbrandonwalker commented 3 years ago

Oh my goodness I completely forgot, that my work_queue_worker (7.3.1) was using different enviorment than my work_queue (7.3.4)

btovar commented 3 years ago

I'm glad it was an easy fix!

misterbrandonwalker commented 3 years ago

Same :)

On Mon, Oct 25, 2021 at 9:30 AM Benjamin Tovar @.***> wrote:

I'm glad it was an easy fix!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/cooperative-computing-lab/cctools/issues/2718#issuecomment-950988584, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKNB26OXMC4FZXZ4E5O6UJLUIVSWNANCNFSM5GRRS45Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.