Support for CUDA emulator for AMD GPUs

AznamirWoW commented 2 months ago

disabled CUDNN when Zluda is detected added fallback for FFT since it is not supported by HIP SDK preventing HIP Crash by jit decorator

blaisewf commented 2 months ago

@deiteris is an amd pro, maybe he can share some thoughts about this changes, because i can’t test them 😅

blaisewf commented 2 months ago

also, the installer needs some modifications to install all the necessary libraries

deiteris commented 2 months ago

Unfortunately I cannot test yet (RX 6600 is not supported by HIP SDK and recompiling it on Windows is a pita), so atm I can either trust blindly or if you can show it working :) See review comments too.

blaisewf commented 2 months ago

cool reviews, there is a huge difference in speed between zluda and dml?

AznamirWoW commented 2 months ago

Unfortunately I cannot test yet (RX 6600 is not supported by HIP SDK and recompiling it on Windows is a pita), so atm I can either trust blindly or if you can show it working :) See review comments too.

you just need a set of compiled libraries for gfx1032 https://cdn.discordapp.com/attachments/1207548125841989702/1262184101889441792/gfx1032.7z?ex=669f8f4c&is=669e3dcc&hm=522a86aee4e84f8294125ddd54ec7c631089269dcde7b7c25aaeea42a6b196e0&

deiteris commented 2 months ago

Unfortunately I cannot test yet (RX 6600 is not supported by HIP SDK and recompiling it on Windows is a pita), so atm I can either trust blindly or if you can show it working :) See review comments too.

you just need a set of compiled libraries for gfx1032 https://cdn.discordapp.com/attachments/1207548125841989702/1262184101889441792/gfx1032.7z?ex=669f8f4c&is=669e3dcc&hm=522a86aee4e84f8294125ddd54ec7c631089269dcde7b7c25aaeea42a6b196e0&

Oh, thanks! I'm in the process of setting up everything, so I'll update you shortly.

deiteris commented 2 months ago

Doesn't seem to work correctly for me unfortunately. ZLUDA configuration seems to apply on startup:

And I can see ZLUDA device in Advanced Settings in training:

But inference simply gets stuck (I waited more than 1 min, but it just doesn't finish). Audio sample is around 5 seconds long, it should've been processed in no time (takes ~70ms with DirectML backend).

I see that GPU memory gets allocated, but no GPU usage occurs and there are no errors 😢:

Here are the steps that I took:

Downloaded and installed HIP SDK from https://www.amd.com/en/developer/resources/rocm-hub/eula/licenses.html?filename=AMD-Software-PRO-Edition-24.Q3-Win10-Win11-For-HIP.exe.
Downloaded Tensile libraries that you provided and extracted them in C:\Program Files\AMD\ROCm\6.1\bin\rocblas\library (replaced files when prompted).
Downloaded ZLUDA from here: https://github.com/lshqqytiger/ZLUDA/releases/tag/rel.86cdab3b14b556e95eafe370b8e8a1a80e8d093b
Extracted ZLUDA in Applio folder for convenience.
Downloaded and installed Python 3.10.11.
Created venv with python -m venv venv and activated .\venv\Scripts\Activate.ps1.
Modified requirements to include torch built with CUDA 11.8.
Installed requirements with pip install -r requirements.txt.

Copied ZLUDA DLLs with batch script in my venv folder:

  copy zluda\cublas.dll venv\Lib\site-packages\torch\lib\cublas64_11.dll /y
  copy zluda\cusparse.dll venv\Lib\site-packages\torch\lib\cusparse64_11.dll /y
  copy zluda\nvrtc.dll venv\Lib\site-packages\torch\lib\nvrtc64_112_0.dll /y

Finally, run with .\zluda\zluda.exe -- python .\app.py.

AznamirWoW commented 2 months ago

Doesn't seem to work correctly for me unfortunately. ZLUDA configuration seems to apply on startup:

And I can see ZLUDA device in Advanced Settings in training:

But inference simply gets stuck (I waited more than 1 min, but it just doesn't finish). Audio sample is around 5 seconds long, it should've been processed in no time (takes ~70ms with DirectML backend).

I see that GPU memory gets allocated, but no GPU usage occurs and there are no errors 😢:

Here are the steps that I took:

1. Downloaded and installed HIP SDK from https://www.amd.com/en/developer/resources/rocm-hub/eula/licenses.html?filename=AMD-Software-PRO-Edition-24.Q3-Win10-Win11-For-HIP.exe.

2. Downloaded Tensile libraries that you provided and extracted them in `C:\Program Files\AMD\ROCm\6.1\bin\rocblas\library` (replaced files when prompted).

3. Downloaded ZLUDA from here: https://github.com/lshqqytiger/ZLUDA/releases/tag/rel.86cdab3b14b556e95eafe370b8e8a1a80e8d093b

4. Extracted ZLUDA in Applio folder for convenience.
   ![image](https://private-user-images.githubusercontent.com/6103913/351051599-6b9c02cd-02c3-41c0-b391-8120c40403dd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE2NzE5ODQsIm5iZiI6MTcyMTY3MTY4NCwicGF0aCI6Ii82MTAzOTEzLzM1MTA1MTU5OS02YjljMDJjZC0wMmMzLTQxYzAtYjM5MS04MTIwYzQwNDAzZGQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcyMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MjJUMTgwODA0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDZjOGM1NTMwODcyMzNlOGU3ZmE0MzE1MTJkZjAwZTVjOTc4NDc5YWI5NDdiMWFjOGNhMjA4NWY3NWUzOTkwOCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.lVvgptnPRGiURmjzYUQTj0iUwRyxXy26aiUgeUh6vOU)

5. Downloaded and installed Python 3.10.11.

6. Created venv with `python -m venv venv` and activated `.\venv\Scripts\Activate.ps1`.

7. Modified requirements to include torch built with CUDA 11.8.
   ![image](https://private-user-images.githubusercontent.com/6103913/351052235-c3a2ef08-5e1f-4053-ba50-8da1e12b2d58.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE2NzE5ODQsIm5iZiI6MTcyMTY3MTY4NCwicGF0aCI6Ii82MTAzOTEzLzM1MTA1MjIzNS1jM2EyZWYwOC01ZTFmLTQwNTMtYmE1MC04ZGExZTEyYjJkNTgucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcyMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MjJUMTgwODA0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZWMyZDk3NmVkMTRiYzI5YmQ3NjM1MGM1MjIyZDJhZDRkODRhMzhjZmJjOGM4NzQzZGE5MDkzYjM2YjlmZjllYiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.ByJV0N8X7IZIot9OWJpM_IjM-U_4h3ypPGjDxd1sTTU)

8. Installed requirements with `pip install -r requirements.txt`.

9. Copied ZLUDA DLLs with batch script in my venv folder:
   ```
   copy zluda\cublas.dll venv\Lib\site-packages\torch\lib\cublas64_11.dll /y
   copy zluda\cusparse.dll venv\Lib\site-packages\torch\lib\cusparse64_11.dll /y
   copy zluda\nvrtc.dll venv\Lib\site-packages\torch\lib\nvrtc64_112_0.dll /y
   ```

10. Finally, run with `.\zluda\zluda.exe -- python .\app.py`.

initial run for inference and training may take a while (5-20 minutes) because Zluda compiles the kernel code in the background. That only happens once

AznamirWoW commented 2 months ago

here's my benchmark

7600X CPU:

Converting audio 'assets\audios\12_min_audio.mp3'... Conversion completed at 'assets\audios\12_min_audio_output.wav' in 433.26 seconds.

7600X CPU + 6700XT GPU / Zluda

Converting audio 'assets\audios\12_min_audio.mp3'... Conversion completed at 'assets\audios\12_min_audio_output.wav' in 32.49 seconds.

deiteris commented 2 months ago

Oh, you're right. It did take quite a while to compile (1226 seconds to be precise) and peaked at 4.2GB of RAM at some point. Inference seems to work well in general, at least I don't observe any issues with it.

Might be unrelated to ZLUDA, but inference consumes a lot of VRAM. For example, this is with 01:36 audio length:

Converting audio 'assets\audios\Pura_Pura_Lupa_VERSI_JEPANG_Mahen_-_Andi_Adinata_Cover.m4a_Vocals_Mel-RoFormer_No_Reverb.wav'...
Conversion completed at 'assets\audios\Pura_Pura_Lupa_VERSI_JEPANG_Mahen_-_Andi_Adinata_Cover.m4a_Vocals_Mel-RoFormer_No_Reverb_output.wav' in 11.57 seconds.

And this is with 04:21 audio length:

Converting audio 'assets\audios\Eminem_-_Lose_Yourself_vocals_only_explicit_content_MORE_intense_without_the_music__CHECK_IT.wav'...
Conversion completed at 'assets\audios\Eminem_-_Lose_Yourself_vocals_only_explicit_content_MORE_intense_without_the_music__CHECK_IT_output.wav' in 53.29 seconds.

Haven't tested training yet.

AznamirWoW commented 2 months ago

What's the point of having VRAM and not using it?

I'm more concerned about Applio holding ~4GB after inference is done

deiteris commented 2 months ago

What's the point of having VRAM and not using it?

What I mean is that it has very negative performance impact. Once it hits VRAM limit, it goes for system RAM. And in my case, it took 8GB VRAM + 3GB of system RAM which is unacceptable.

But anyway, someone with Nvidia GPU needs to confirm this behavior. IMO, this is out of scope of this PR.

LGTM.

blaisewf commented 2 months ago

if all the changes are okay, we have to find an easy way for the users to run it

AznamirWoW commented 2 months ago

if all the changes are okay, we have to find an easy way for the users to run it

I have a .bat file that does the cu118 torch download and zluda patching

blaisewf commented 2 months ago

if all the changes are okay, we have to find an easy way for the users to run it

I have a .bat file that does the cu118 torch download and zluda patching

we can add a run-applio-amd run-install-amd

IAHispano / Applio

Support for CUDA emulator for AMD GPUs #513