How to train models with multiple gpus?

decisionintelligence / TFB

[PVLDB 2024 Best Paper Nomination] TFB: Towards Comprehensive and Fair Benchmarking of Time Series Forecasting Methods

https://www.vldb.org/pvldb/vol17/p2363-hu.pdf

MIT License

350 stars 36 forks source link

How to train models with multiple gpus? #37

Open YHYHYHYHYHY opened 3 weeks ago

YHYHYHYHYHY commented 3 weeks ago

TFB is truly one of the best time series benchmarks I have ever had the pleasure of using. However, I have encountered an issue when attempting to train models using multiple GPUs.

As you may be aware, the Ray backend is employed for parallel processing. When I run the training of a model using the Ray backend on a Linux server equipped with four GPUs, only a single GPU is actually utilized. Moreover, the working GPU can change in different experiments, sometimes being cuda:0 and at other times cuda:1. Through extensive debugging, I am certain that the Ray backend successfully detects all the GPUs on the server. Nevertheless, only one GPU can be used. It appears that the parallel training aspect with PyTorch is not being realized, or perhaps I have overlooked a relevant part.

I would greatly appreciate your assistance in addressing this issue. Thank you very much!

Best regards.

luckiezhou commented 3 weeks ago

Thank you for your attention to TFB! Can you be more specific on your issue, namely:

are you trying to run multiple workers on multiple GPUs?
or are you trying to run one multi-GPU worker?

It would be appreciated if you could provide the script/command that caused the issue.

I guess from your description you are trying to run multi-GPU jobs with ray backend, which is not fully supported from two aspects:

modern parallel training strategies usually adopt the Single Program, Multiple Data (SPMD) paradigm, which conflicts with the ray-backend design (I'm not sure whether the latest ray versions have better support for this) which launches general workers for multiple tasks. The potential problems of combining ray with SPMD parallel include but are not limited to:
- each process launched by DDP tries to load full data;
- ray cannot manage the extra processes properly, causing the program to hang sometimes;
most models in our baseline do not support multi-GPU parallel training, the only sets of multi-GPU-supported baselines that I recall are the darts-based algorithms, which make use of the parallel strategies (DDP by default, which unfortunately is disabled explicitly in our code...) in pytorch_lightning library, @qiu69 please supplement if I'm missing anything.
- traditional DP strategy may work with ray backend, but it is not generally supported by our baselines, you can try modifying the code or use your own implementations;

So, I'd suggest using the sequential backend if your model (most of our baselines require further modification to make it work with multiple GPUs) works only with more than one GPU or using the Ray backend and limiting the model to use only one GPU. We are still trying to design a more general backend compatible with SPMD training so that TFB scales to more GPUs, and we would appreciate any idea or contribution.

YHYHYHYHYHY commented 3 weeks ago

Thank you for your response. To clarify my original issue further, I would like to provide the following additional details: 1) I am attempting to run a single multi - GPU worker. It would be even more beneficial if the system could also support running multiple multi - GPU or single - GPU workers. 2) In the ./scripts/run_benchmark.py file, I have added the line args.gpus = np.arange(torch.cuda.device_count()).tolist(). This is to allocate all the available GPUs to args.gpus. On my server, which has 4 GPUs, this results in the list [0, 1, 2, 3]. 3) The script used for launching the process remains the same as the ones provided in the original repository.

I still have some points of confusion: 1) Does this imply that I can use torch.nn.DataParallel to implement DP with the Sequential Backend by modifying the codes of the baselines? 2) In the ./ts_benchmark/utils/parallel/ray_backend.py file, I noticed the _get_gpus_per_worker() function within the RayBackend class, which is responsible for allocating GPU resources to each worker. Does this mean that Ray already supports multiple workers on multiple GPUs?

I believe that supporting multi - GPU training is of great significance for TFB. If possible, I am more than willing to assist in implementing this additional functionality using torch.nn.DataParallel.

Thank you for your attention.

luckiezhou commented 3 weeks ago

Sequential backend is basically running models in a for-loop, so DP is likely to work, with a few concerns:

The baseline code needs to be modified to support DP (and darts-based baselines need to remove the _fix_multi_gpu method, which is an overkill for sequential backend + DP).
I'm not sure whether problems like managing multi-process data loaders, GPU memory failing to release, etc., will arise in this scenario. I mean, tests are necessary.

To my knowledge, Ray's GPU support simply sets CUDA_VISIBLE_DEVICES for workers/actors, which is not sufficient to support complex training strategies. The key difference lies in that ray actors are general-purpose workers that accept a function as workload, which is similar to multiprocessing. But in strategies like DDP, dedicated workers are launched to run the full script. These workers execute in a one-shot manner, meaning they run the script from start to finish in a single execution without restarting. As for _get_gpus_per_worker, it is only used to prepare or limit the GPU resources for workers, and it depends on whether workers can use them properly. In our own experiments, we always ensure 'gpu_per_worker <= 1`.

YHYHYHYHYHY commented 3 weeks ago

I have made certain modifications to enable multi - GPU training with the Sequential Backend. This modification has been tested on a server equipped with four GPUs, and it has been working smoothly. If possible, I can conduct more detailed tests and then incorporate this function into TFB.

luckiezhou commented 3 weeks ago

Thank you for your interest in contributing to our project! We find it embarrassing that we do not currently have a contribution guideline. So I'll mention some key points for code contribution that come to my mind:

How to: You can fork the repository to your github account, work on your code in a dedicated branch, and create a pull request in TFB main repository. The branch name is better to be descriptive, for example, feat/dp-support-for-xxx-models
Code standards: Code standards help maintain consistency across the codebase. As we do not have a guideline, the code standard problems will be fed back through PR reviewing. According to our experience in past PRs, the majority of problems lie in the naming accuracy of variables, handling of unexpected exceptions, etc. Btw, we recommend formatting the code with black.
Testing: Please submit the code after you think you have done enough testing, and we will also perform some testing during the reviewing process.
Documentation: Please include docstring for any new functions or classes that you think are important, and please update existing documentation such as README.md if it is necessary to reflect your changes.
Pull Request Description: When you create a pull request, provide a detailed description of the changes you've made. Explain the problem you're solving or the feature you're adding, and include any relevant issue numbers.
Review Process: We will review your pull request as soon as possible. Please be patient and responsive to any feedback or requests for changes. This collaborative process helps us maintain high-quality code.
Licensing: Ensure that your contribution is compatible with our project's licensing terms. By submitting your code, you agree that it can be included in the project under our existing license.

Once you've addressed these points, feel free to submit your pull request. We look forward to reviewing your contribution!

Thank you for your collaboration and support.