Open trupljan opened 1 year ago
Hello, nice to see your integration. Have to tried or experimented with HiveMind which we have already implemented as a strategy...
I have managed to fork master branch and propagate my changes into it. I also improved the module that in case when torch-directml is not installed then DML backend would not also be available so it does not prevent lightning from running - thus DirectML is only an optional backend. Also it is used only when user request "dml" or "gpu", but not with "auto" as it is slower than CPU for smaller networks.
Changes are there: https://github.com/Lightning-AI/lightning/compare/master...trupljan:lightning:master
I will play with my implementation for some time and then prepare it for pull request, please give any advice what other stuff should be done before I would do so.
OK, so freezes were caused by Validation DataLoader that takes all the memory, probably my fault:
What might be the cause?
Description & Motivation
Hello, I want DirectML backend support, so I have implemented a prototype of DirectML backend for pytorch-lightning so that there is some starting point for the feature implementation. It is based on 2.0.6 code from pypi. Only single device strategy is supported. I tested the backend on three GPUs as I have triple-gpu system. It seems to work for simple cases, but it is rather experimental. I want to use this for time series prediction from Darts framework, but it sometimes freezes with large networks, or it takes very long before training starts (depends on batch size, but there are not such problems with CUDA...). With simple stuff like in test.py it works well.
Pitch
The way I have implemented the prototype is that I directly edited files in site-packages in my venv environment, every edited file contains a tag DMLPatch in a comment at the end of file to find changed files easily. I don't have any experience with preparing a pull-request, so I would appreciate some help in case you are interested - I don't know which branch to fork to prepare a pull request and how to proceed.
There is the source code, it should be enough to extract it to site-packages in venv: pytorch_lightning.zip
Testing code: test.py.txt
Example from Jupyter:
Alternatives
There are no alternatives, I want to be able to also use other GPUs than those from NVIDIA as they are cheaper per GB VRAM.
Additional context
Showing that it utilizes correct gpu when launched:
Intel Arc A770 16GB LE:
NVIDIA GTX 1650 Super 4GB:
Ryzen 5600G Vega APU 16GB RAM:
cc @borda