Open cooperlab opened 1 year ago
This page from Ray Tune's documentation could prove helpful. Seems like wrapping trainable
in a call to tune.with_resources
could be worth trying, e.g.,
tune.Tuner(tune.with_resources(trainable, {"cpu": 2, "gpu": 1}, tune_config=tune.TuneConfig(num_samples=8)))
I think this should happen around here in Search.experiment.
Branch issue-26-Allocated_resources_not_scaling_trials created!
The number of concurrent running trails directly depends on the number of GPU/CPU cores we allocate to each trail, which means adding more CPU and GPU cores to the trails is not necessarily increase the number of running trails but it also decreases that. based on the experiments, it has been shown that the speed of tuning greatly depends on the number of concurrent running trials than number of resources allocated to each trial.
Please take a look at PR https://github.com/PathologyDataScience/glimr/pull/56 in which I have added something to support multi-gpu distributed tuning.
@RaminNateghi please see my comments on the PR.
Since it is hard to saturate the GPUs during MIL training, please investigate if it is possible to allocate fractional GPU resources. For example, perhaps we can run 16 trials by allocating 0.5 GPUs / trial. This might increase utilization.
We will also need to edit documentation and notebooks once the Search class updates are final.
Yes, it's technically possible to allocate fractional GPU resources. For example, I just set resources_per_worker={"GPU": 0.25}
, and it enabled tuner to run 4X concurrent trials.
Yes, it's technically possible to allocate fractional GPU resources. For example, I just set
resources_per_worker={"GPU": 0.25}
, and it enabled tuner to run 4X concurrent trials.
Can you check if this increases utilization from nvidia-smi
?
yes, it increases the utilization, but when we use fractional gpu resources, some trials are failed with these errors "worker/replica:0/task:0/device:GPU:0}} failed to allocate memory [Op:Cast]"
or "failed copying input tensor from /job:worker/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized"
.
For example, in my experiment, 13 trials out of 64 trials are failed when I allocate 0.5 GPUs per trial.
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/1469
OK - let's save that for another day. For now we can recommend integer allocations.
Perhaps this is related to the tendency of TensorFlow to allocate all GPU memory even for a small job.
https://docs.ray.io/en/latest/ray-core/tasks/using-ray-with-gpus.html#fractional-gpus
Note: It is the user’s responsibility to make sure that the individual tasks don’t use more than their share of the GPU memory. TensorFlow can be configured to limit its memory usage.
I'm not sure if this is the best solution, but it's one solution: https://discuss.ray.io/t/tensorflow-allocates-all-available-memory-on-the-gpu-in-the-first-trial-leading-to-no-space-left-for-running-additional-trials-in-parallel/7585/2
For both GPU and CPU, increasing the resources does not increase the number of concurrently running trials.