decisionintelligence / TFB

[PVLDB 2024 Best Paper Nomination] TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods
https://www.vldb.org/pvldb/vol17/p2363-hu.pdf
MIT License
350 stars 36 forks source link

How to train models with multiple gpus? #37

Open YHYHYHYHYHY opened 3 weeks ago

YHYHYHYHYHY commented 3 weeks ago

TFB is truly one of the best time series benchmarks I have ever had the pleasure of using. However, I have encountered an issue when attempting to train models using multiple GPUs.

As you may be aware, the Ray backend is employed for parallel processing. When I run the training of a model using the Ray backend on a Linux server equipped with four GPUs, only a single GPU is actually utilized. Moreover, the working GPU can change in different experiments, sometimes being cuda:0 and at other times cuda:1. Through extensive debugging, I am certain that the Ray backend successfully detects all the GPUs on the server. Nevertheless, only one GPU can be used. It appears that the parallel training aspect with PyTorch is not being realized, or perhaps I have overlooked a relevant part.

I would greatly appreciate your assistance in addressing this issue. Thank you very much!

Best regards.

luckiezhou commented 3 weeks ago

Thank you for your attention to TFB! Can you be more specific on your issue, namely:

It would be appreciated if you could provide the script/command that caused the issue.

I guess from your description you are trying to run multi-GPU jobs with ray backend, which is not fully supported from two aspects:

So, I'd suggest using the sequential backend if your model (most of our baselines require further modification to make it work with multiple GPUs) works only with more than one GPU or using the Ray backend and limiting the model to use only one GPU. We are still trying to design a more general backend compatible with SPMD training so that TFB scales to more GPUs, and we would appreciate any idea or contribution.

YHYHYHYHYHY commented 3 weeks ago

Thank you for your response. To clarify my original issue further, I would like to provide the following additional details: 1) I am attempting to run a single multi - GPU worker. It would be even more beneficial if the system could also support running multiple multi - GPU or single - GPU workers. 2) In the ./scripts/run_benchmark.py file, I have added the line args.gpus = np.arange(torch.cuda.device_count()).tolist(). This is to allocate all the available GPUs to args.gpus. On my server, which has 4 GPUs, this results in the list [0, 1, 2, 3]. 3) The script used for launching the process remains the same as the ones provided in the original repository.

I still have some points of confusion: 1) Does this imply that I can use torch.nn.DataParallel to implement DP with the Sequential Backend by modifying the codes of the baselines? 2) In the ./ts_benchmark/utils/parallel/ray_backend.py file, I noticed the _get_gpus_per_worker() function within the RayBackend class, which is responsible for allocating GPU resources to each worker. Does this mean that Ray already supports multiple workers on multiple GPUs?

I believe that supporting multi - GPU training is of great significance for TFB. If possible, I am more than willing to assist in implementing this additional functionality using torch.nn.DataParallel.

Thank you for your attention.

luckiezhou commented 3 weeks ago
  1. Sequential backend is basically running models in a for-loop, so DP is likely to work, with a few concerns:
  1. To my knowledge, Ray's GPU support simply sets CUDA_VISIBLE_DEVICES for workers/actors, which is not sufficient to support complex training strategies. The key difference lies in that ray actors are general-purpose workers that accept a function as workload, which is similar to multiprocessing. But in strategies like DDP, dedicated workers are launched to run the full script. These workers execute in a one-shot manner, meaning they run the script from start to finish in a single execution without restarting. As for _get_gpus_per_worker, it is only used to prepare or limit the GPU resources for workers, and it depends on whether workers can use them properly. In our own experiments, we always ensure 'gpu_per_worker <= 1`.
YHYHYHYHYHY commented 3 weeks ago

I have made certain modifications to enable multi - GPU training with the Sequential Backend. This modification has been tested on a server equipped with four GPUs, and it has been working smoothly. If possible, I can conduct more detailed tests and then incorporate this function into TFB.

luckiezhou commented 3 weeks ago

Thank you for your interest in contributing to our project! We find it embarrassing that we do not currently have a contribution guideline. So I'll mention some key points for code contribution that come to my mind:

Once you've addressed these points, feel free to submit your pull request. We look forward to reviewing your contribution!

Thank you for your collaboration and support.