Closed phobson closed 2 years ago
@gjoseph92 notes:
If the tasks are annotated to each require
GPU: 1
resource, and the worker is configured to have 1 GPU resource, then they shouldn’t be able to run simultaneously. If they are, that’s a bug. We’ll want to see the code they’re using to create the cluster and to create the tasks.
That would look like this:
with dask.annotate(resources={'GPU': 1}):
...
Doc links:
@phobson Thanks I will try this while creating dask computation graph.
@PranjalSahu were you not using dask.annotate
in your code currently? Were you setting coiled.Cluster(..., worker_options=dict(resources=dict(GPU=1)))
?
Dask does not schedule based on memory usage whatsoever (GPU or otherwise). It will happily do things at the same time which, in total, will use too much memory. If you have tasks that you know cannot run at the same time as each other, then you need to use resources (or other mechanisms, like a dask.utils.SerializableLock
) to prevent them from running concurrently.
Another option, depending on what your tasks look like, would be to just give the workers 1 thread. Then only one of any type of task can ever be running at once. This could cause underutilization though, depending on what you're doing.
I have not used dask.annotate yet. I will use it now for GPU tasks and reply here.
@PranjalSahu how are things working? Let me know if you'd like to discuss things.
I have been running into GPU memory issues. Looks like the memory does not get free once the task finishes execution. So eventually the task dies after working fine for a few patients. I have explicitly cleared GPU memory before and after execution using torch.cuda.empty_cache().
Now I am adding option for worker_class='dask_cuda.CUDAWorker'.
Recently the cluster is not able to start due "Process never phoned home" message. So I could not test the CUDAWorker option yet.
I have been running into GPU memory issues. Looks like the memory does not get free once the task finishes execution. So eventually the task dies after working fine for a few patients. I have explicitly cleared GPU memory before and after execution using torch.cuda.empty_cache().
Now I am adding option for worker_class='dask_cuda.CUDAWorker'.
Recently the cluster is not able to start due "Process never phoned home" message. So I could not test the CUDAWorker option yet.
Hello @PranjalSahu I'm a Coiled Software Engineer and just looked at this issue (Process never phoned home). Could you let me know how you are trying to create the cluster? Are you specifying the worker_gpu=1
argument?
The reason why I am asking this, is because from the logs it seems that there are some drivers errors, from the logs I can see nvml error: driver not loaded
.
Thank you and apologies for any issues that this may have caused you
I am creating cluster like this:
cluster = coiled.Cluster(
name='gpucluster15',
scheduler_vm_types=['t3.large'],
#worker_vm_types=["g4dn.xlarge", "g4dn.2xlarge", "g4ad.xlarge", "p3.2xlarge", "p2.xlarge", "g5.2xlarge"],
worker_vm_types=["p3.2xlarge", "p2.xlarge", "g5.2xlarge"],
n_workers=2,
software="pranjal-sahu/gpu-test11",
worker_options=dict(resources=dict(GPU=1)),
worker_class='dask_cuda.CUDAWorker',
shutdown_on_close=True,
)
In my current pipeline I am deleting the model after inferencing and calling gc.collect(). And I had to remove the options dask_cuda.CUDAWorker and dict(resouces=dict(GPU=1)).
https://stackoverflow.com/questions/70051984/how-to-clear-gpu-memory-after-using-model
I have not run into GPU memory issues after this for a while. My expectation was that it would get cleaned up once the task finishes i.e. it gets out of scope.
Hi, @PranjalSahu. It sounds like
Is that right? (Let us know if you're still having issues or need help with this!)
@ntabris Yes, It is working. I have tested it multiple times and with more number of patients. Deleting the model object after inference solves the GPU memory problem.
The DASK tutorials on GPU usage focus on examples where it performs batch processing so the GPU memory remains allocated for that duration. But in our use case, the GPU computation is done on only one sample at a time and is a heavy computation. The GPU memory needs to be freed for the next task.
@PranjalSahu thanks for following up. I'm going to mark this issue as closed. Feel free to reopen if you think otherwise.
A user wrote Nat with the follow dask question:
I'll update this issue with more context as it comes in.