Closed AznamirWoW closed 2 months ago
@deiteris is an amd pro, maybe he can share some thoughts about this changes, because i can’t test them 😅
also, the installer needs some modifications to install all the necessary libraries
Unfortunately I cannot test yet (RX 6600 is not supported by HIP SDK and recompiling it on Windows is a pita), so atm I can either trust blindly or if you can show it working :) See review comments too.
cool reviews, there is a huge difference in speed between zluda and dml?
Unfortunately I cannot test yet (RX 6600 is not supported by HIP SDK and recompiling it on Windows is a pita), so atm I can either trust blindly or if you can show it working :) See review comments too.
you just need a set of compiled libraries for gfx1032 https://cdn.discordapp.com/attachments/1207548125841989702/1262184101889441792/gfx1032.7z?ex=669f8f4c&is=669e3dcc&hm=522a86aee4e84f8294125ddd54ec7c631089269dcde7b7c25aaeea42a6b196e0&
Unfortunately I cannot test yet (RX 6600 is not supported by HIP SDK and recompiling it on Windows is a pita), so atm I can either trust blindly or if you can show it working :) See review comments too.
you just need a set of compiled libraries for gfx1032 https://cdn.discordapp.com/attachments/1207548125841989702/1262184101889441792/gfx1032.7z?ex=669f8f4c&is=669e3dcc&hm=522a86aee4e84f8294125ddd54ec7c631089269dcde7b7c25aaeea42a6b196e0&
Oh, thanks! I'm in the process of setting up everything, so I'll update you shortly.
Doesn't seem to work correctly for me unfortunately. ZLUDA configuration seems to apply on startup:
And I can see ZLUDA device in Advanced Settings in training:
But inference simply gets stuck (I waited more than 1 min, but it just doesn't finish). Audio sample is around 5 seconds long, it should've been processed in no time (takes ~70ms with DirectML backend).
I see that GPU memory gets allocated, but no GPU usage occurs and there are no errors 😢:
Here are the steps that I took:
C:\Program Files\AMD\ROCm\6.1\bin\rocblas\library
(replaced files when prompted).python -m venv venv
and activated .\venv\Scripts\Activate.ps1
.pip install -r requirements.txt
. copy zluda\cublas.dll venv\Lib\site-packages\torch\lib\cublas64_11.dll /y
copy zluda\cusparse.dll venv\Lib\site-packages\torch\lib\cusparse64_11.dll /y
copy zluda\nvrtc.dll venv\Lib\site-packages\torch\lib\nvrtc64_112_0.dll /y
.\zluda\zluda.exe -- python .\app.py
.Doesn't seem to work correctly for me unfortunately. ZLUDA configuration seems to apply on startup:
And I can see ZLUDA device in Advanced Settings in training:
But inference simply gets stuck (I waited more than 1 min, but it just doesn't finish). Audio sample is around 5 seconds long, it should've been processed in no time (takes ~70ms with DirectML backend).
I see that GPU memory gets allocated, but no GPU usage occurs and there are no errors 😢:
Here are the steps that I took:
1. Downloaded and installed HIP SDK from https://www.amd.com/en/developer/resources/rocm-hub/eula/licenses.html?filename=AMD-Software-PRO-Edition-24.Q3-Win10-Win11-For-HIP.exe. 2. Downloaded Tensile libraries that you provided and extracted them in `C:\Program Files\AMD\ROCm\6.1\bin\rocblas\library` (replaced files when prompted). 3. Downloaded ZLUDA from here: https://github.com/lshqqytiger/ZLUDA/releases/tag/rel.86cdab3b14b556e95eafe370b8e8a1a80e8d093b 4. Extracted ZLUDA in Applio folder for convenience. ![image](https://private-user-images.githubusercontent.com/6103913/351051599-6b9c02cd-02c3-41c0-b391-8120c40403dd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE2NzE5ODQsIm5iZiI6MTcyMTY3MTY4NCwicGF0aCI6Ii82MTAzOTEzLzM1MTA1MTU5OS02YjljMDJjZC0wMmMzLTQxYzAtYjM5MS04MTIwYzQwNDAzZGQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcyMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MjJUMTgwODA0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDZjOGM1NTMwODcyMzNlOGU3ZmE0MzE1MTJkZjAwZTVjOTc4NDc5YWI5NDdiMWFjOGNhMjA4NWY3NWUzOTkwOCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.lVvgptnPRGiURmjzYUQTj0iUwRyxXy26aiUgeUh6vOU) 5. Downloaded and installed Python 3.10.11. 6. Created venv with `python -m venv venv` and activated `.\venv\Scripts\Activate.ps1`. 7. Modified requirements to include torch built with CUDA 11.8. ![image](https://private-user-images.githubusercontent.com/6103913/351052235-c3a2ef08-5e1f-4053-ba50-8da1e12b2d58.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE2NzE5ODQsIm5iZiI6MTcyMTY3MTY4NCwicGF0aCI6Ii82MTAzOTEzLzM1MTA1MjIzNS1jM2EyZWYwOC01ZTFmLTQwNTMtYmE1MC04ZGExZTEyYjJkNTgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcyMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MjJUMTgwODA0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZWMyZDk3NmVkMTRiYzI5YmQ3NjM1MGM1MjIyZDJhZDRkODRhMzhjZmJjOGM4NzQzZGE5MDkzYjM2YjlmZjllYiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ByJV0N8X7IZIot9OWJpM_IjM-U_4h3ypPGjDxd1sTTU) 8. Installed requirements with `pip install -r requirements.txt`. 9. Copied ZLUDA DLLs with batch script in my venv folder: ``` copy zluda\cublas.dll venv\Lib\site-packages\torch\lib\cublas64_11.dll /y copy zluda\cusparse.dll venv\Lib\site-packages\torch\lib\cusparse64_11.dll /y copy zluda\nvrtc.dll venv\Lib\site-packages\torch\lib\nvrtc64_112_0.dll /y ``` 10. Finally, run with `.\zluda\zluda.exe -- python .\app.py`.
initial run for inference and training may take a while (5-20 minutes) because Zluda compiles the kernel code in the background. That only happens once
here's my benchmark
7600X CPU:
Converting audio 'assets\audios\12_min_audio.mp3'... Conversion completed at 'assets\audios\12_min_audio_output.wav' in 433.26 seconds.
7600X CPU + 6700XT GPU / Zluda
Converting audio 'assets\audios\12_min_audio.mp3'... Conversion completed at 'assets\audios\12_min_audio_output.wav' in 32.49 seconds.
Oh, you're right. It did take quite a while to compile (1226 seconds to be precise) and peaked at 4.2GB of RAM at some point. Inference seems to work well in general, at least I don't observe any issues with it.
Might be unrelated to ZLUDA, but inference consumes a lot of VRAM. For example, this is with 01:36 audio length:
Converting audio 'assets\audios\Pura_Pura_Lupa_VERSI_JEPANG_Mahen_-_Andi_Adinata_Cover.m4a_Vocals_Mel-RoFormer_No_Reverb.wav'...
Conversion completed at 'assets\audios\Pura_Pura_Lupa_VERSI_JEPANG_Mahen_-_Andi_Adinata_Cover.m4a_Vocals_Mel-RoFormer_No_Reverb_output.wav' in 11.57 seconds.
And this is with 04:21 audio length:
Converting audio 'assets\audios\Eminem_-_Lose_Yourself_vocals_only_explicit_content_MORE_intense_without_the_music__CHECK_IT.wav'...
Conversion completed at 'assets\audios\Eminem_-_Lose_Yourself_vocals_only_explicit_content_MORE_intense_without_the_music__CHECK_IT_output.wav' in 53.29 seconds.
Haven't tested training yet.
What's the point of having VRAM and not using it?
I'm more concerned about Applio holding ~4GB after inference is done
What's the point of having VRAM and not using it?
What I mean is that it has very negative performance impact. Once it hits VRAM limit, it goes for system RAM. And in my case, it took 8GB VRAM + 3GB of system RAM which is unacceptable.
But anyway, someone with Nvidia GPU needs to confirm this behavior. IMO, this is out of scope of this PR.
LGTM.
if all the changes are okay, we have to find an easy way for the users to run it
if all the changes are okay, we have to find an easy way for the users to run it
I have a .bat file that does the cu118 torch download and zluda patching
if all the changes are okay, we have to find an easy way for the users to run it
I have a .bat file that does the cu118 torch download and zluda patching
we can add a run-applio-amd run-install-amd
disabled CUDNN when Zluda is detected added fallback for FFT since it is not supported by HIP SDK preventing HIP Crash by jit decorator