Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.53k stars 3.3k forks source link

DirectML backend implementation prototype #18188

Open trupljan opened 11 months ago

trupljan commented 11 months ago

Description & Motivation

Hello, I want DirectML backend support, so I have implemented a prototype of DirectML backend for pytorch-lightning so that there is some starting point for the feature implementation. It is based on 2.0.6 code from pypi. Only single device strategy is supported. I tested the backend on three GPUs as I have triple-gpu system. It seems to work for simple cases, but it is rather experimental. I want to use this for time series prediction from Darts framework, but it sometimes freezes with large networks, or it takes very long before training starts (depends on batch size, but there are not such problems with CUDA...). With simple stuff like in test.py it works well.

Pitch

The way I have implemented the prototype is that I directly edited files in site-packages in my venv environment, every edited file contains a tag DMLPatch in a comment at the end of file to find changed files easily. I don't have any experience with preparing a pull-request, so I would appreciate some help in case you are interested - I don't know which branch to fork to prepare a pull request and how to proceed.

There is the source code, it should be enough to extract it to site-packages in venv: pytorch_lightning.zip

Testing code: test.py.txt

Example from Jupyter: image

Alternatives

There are no alternatives, I want to be able to also use other GPUs than those from NVIDIA as they are cheaper per GB VRAM.

Additional context

Showing that it utilizes correct gpu when launched:

Intel Arc A770 16GB LE: dml_intel

NVIDIA GTX 1650 Super 4GB: dml_nvidia

Ryzen 5600G Vega APU 16GB RAM: dml_amd

cc @borda

Borda commented 11 months ago

Hello, nice to see your integration. Have to tried or experimented with HiveMind which we have already implemented as a strategy...

trupljan commented 11 months ago

I have managed to fork master branch and propagate my changes into it. I also improved the module that in case when torch-directml is not installed then DML backend would not also be available so it does not prevent lightning from running - thus DirectML is only an optional backend. Also it is used only when user request "dml" or "gpu", but not with "auto" as it is slower than CPU for smaller networks.

Changes are there: https://github.com/Lightning-AI/lightning/compare/master...trupljan:lightning:master

I will play with my implementation for some time and then prepare it for pull request, please give any advice what other stuff should be done before I would do so.

trupljan commented 11 months ago

OK, so freezes were caused by Validation DataLoader that takes all the memory, probably my fault:

dml_error

What might be the cause?