Open YHYHYHYHYHY opened 3 weeks ago
Thank you for your attention to TFB! Can you be more specific on your issue, namely:
It would be appreciated if you could provide the script/command that caused the issue.
I guess from your description you are trying to run multi-GPU jobs with ray backend, which is not fully supported from two aspects:
modern parallel training strategies usually adopt the Single Program, Multiple Data (SPMD) paradigm, which conflicts with the ray-backend design (I'm not sure whether the latest ray versions have better support for this) which launches general workers for multiple tasks. The potential problems of combining ray with SPMD parallel include but are not limited to:
most models in our baseline do not support multi-GPU parallel training, the only sets of multi-GPU-supported baselines that I recall are the darts-based algorithms, which make use of the parallel strategies (DDP by default, which unfortunately is disabled explicitly in our code...) in pytorch_lightning library, @qiu69 please supplement if I'm missing anything.
So, I'd suggest using the sequential backend if your model (most of our baselines require further modification to make it work with multiple GPUs) works only with more than one GPU or using the Ray backend and limiting the model to use only one GPU. We are still trying to design a more general backend compatible with SPMD training so that TFB scales to more GPUs, and we would appreciate any idea or contribution.
Thank you for your response. To clarify my original issue further, I would like to provide the following additional details:
1) I am attempting to run a single multi - GPU worker. It would be even more beneficial if the system could also support running multiple multi - GPU or single - GPU workers.
2) In the ./scripts/run_benchmark.py
file, I have added the line args.gpus = np.arange(torch.cuda.device_count()).tolist()
. This is to allocate all the available GPUs to args.gpus
. On my server, which has 4 GPUs, this results in the list [0, 1, 2, 3].
3) The script used for launching the process remains the same as the ones provided in the original repository.
I still have some points of confusion:
1) Does this imply that I can use torch.nn.DataParallel
to implement DP with the Sequential Backend by modifying the codes of the baselines?
2) In the ./ts_benchmark/utils/parallel/ray_backend.py
file, I noticed the _get_gpus_per_worker()
function within the RayBackend
class, which is responsible for allocating GPU resources to each worker. Does this mean that Ray already supports multiple workers on multiple GPUs?
I believe that supporting multi - GPU training is of great significance for TFB. If possible, I am more than willing to assist in implementing this additional functionality using torch.nn.DataParallel
.
Thank you for your attention.
_get_gpus_per_worker
, it is only used to prepare or limit the GPU resources for workers, and it depends on whether workers can use them properly. In our own experiments, we always ensure 'gpu_per_worker <= 1`.I have made certain modifications to enable multi - GPU training with the Sequential Backend. This modification has been tested on a server equipped with four GPUs, and it has been working smoothly. If possible, I can conduct more detailed tests and then incorporate this function into TFB.
Thank you for your interest in contributing to our project! We find it embarrassing that we do not currently have a contribution guideline. So I'll mention some key points for code contribution that come to my mind:
Once you've addressed these points, feel free to submit your pull request. We look forward to reviewing your contribution!
Thank you for your collaboration and support.
TFB is truly one of the best time series benchmarks I have ever had the pleasure of using. However, I have encountered an issue when attempting to train models using multiple GPUs.
As you may be aware, the Ray backend is employed for parallel processing. When I run the training of a model using the Ray backend on a Linux server equipped with four GPUs, only a single GPU is actually utilized. Moreover, the working GPU can change in different experiments, sometimes being cuda:0 and at other times cuda:1. Through extensive debugging, I am certain that the Ray backend successfully detects all the GPUs on the server. Nevertheless, only one GPU can be used. It appears that the parallel training aspect with PyTorch is not being realized, or perhaps I have overlooked a relevant part.
I would greatly appreciate your assistance in addressing this issue. Thank you very much!
Best regards.