getsolus / packages

Solus Package Monorepo & Issue Tracker
64 stars 78 forks source link

Packaging rest of the ROCm stack (T10310) #197

Open celticmagic opened 1 year ago

celticmagic commented 1 year ago
Jacek Jagosz (#Jacek), 2022-07-12 10:03:31 UTC

Now that T6614 is mostly complete Solus should soon hae OpenCL support on AMD as well as the base HIP package. But this is not enough for most programs that support AMD GPU acceleration using ROCm to work. Here is an incomplete list of packages that could make use of ROCm and necessary dependencies: ## Locally See the [rocm-5.5.x](https://github.com/GZGavinZhao/packages/tree/rocm-5.5) branch. - [x] hipify - [x] rocBLAS - [x] hipBLAS - [x] rocFFT - [x] hipFFT - [x] rocPRIM - [x] hipCUB - [x] rocRAND - [x] hipRAND - [x] rocSPARSE - [x] hipSPARSE - [x] rocSOLVER - [x] hipSOLVER - [x] hipmagma - [x] miopen - [x] rccl - [x] amd-aql-profile - [x] rocprofiler - [x] roctracer - [x] PyTorch - [x] torchvision - [ ] torchaudio - [x] Blender - [ ] Tensorflow ## Submitted - [x] hipify - [x] rocBLAS - [x] hipBLAS - [x] rocFFT - [x] hipFFT - [x] rocPRIM - [x] hipCUB - [x] rocRAND - [x] hipRAND - [x] rocSPARSE - [x] hipSPARSE - [x] rocSOLVER - [x] hipSOLVER - [x] hipmagma - [x] miopen - [x] rccl - [x] amd-aql-profile - [x] rocprofiler - [x] roctracer - [x] PyTorch - [x] torchvision - [ ] torchaudio - [x] Blender ~~(turn off GPU Subdivision if you're getting crashes; this is a separate issue from ROCm that we're investigating)~~ this issue has been fixed - [ ] Tensorflow [Here is my repository where I gather all I do with ROCm stack](https://github.com/JacekJagosz/rocm-Solus/blob/main/rocblas/package.yml), including WIP packages. Another one by @GZGavinZhao is [here](https://github.com/GZGavinZhao/solus-rocm). All help with packaging and testing is welcome, as well as ideas on useful packages. Also, all packages dependent on HIP seem to have spotty GPU support and take a lot of time to compile, because there is a separate kernel generated for each supported GPU.
celticmagic commented 1 year ago
Gavin Zhao (#GZGavinZhao), 2022-11-23 19:11:24 UTC

The main holdback is LLVM 15. For the past month or so, I've attempted both ROCm 5.1 and 5.2, and here are the main takeaways: - ROCm 5.1 - `rocfft` can't build due to cryptic LLVM machine code errors, which basically prevents building any ML software for ROCm. - Can't build `blender` because it requires the `__noinline__` attribute to be defined, which is only available for LLVM 15. - ROCm 5.2 - All packages can be built successfully with the current LLVM 14, but it requires 6 patches (including the `__noinline__` one above). - Even with the patched LLVM, `blender` still fails to build due to some other cryptic LLVM machine code errors. - PyTorch fails to detect the correct device machine code ("hipErrorNoBinaryForGpu: Unable to find code object for all current devices!") no matter what I do. `HSA_OVERRIDE_GFX_VERSIONS` with `AMD_LOG_LEVEL` debugging shows that it jumps between machine code detections (e.g. says the file is for `gfx1032` when I'm emulating `gfx1030`, and says the file is for `gfx1030` when I'm on my own `gfx1032` device). Full list of patches required to build ROCm with our LLVM: 1. Code object v5: afc9d674fe5a14b95c50a38d8605a159c2460427 2. Link code objects correctly: 092f15ac40ce35d077e0225a4462bc4dfa379391 3. `__noinline__`: d4e4ef2e81e03246e29e9b6eaa2929ebd4e77784 (the rest are required to apply the `__noinline__` patch) 4. 6655c5a6bb13a7db483d1eea6e1071972b13a62d 5. 223b8240223541d3feb0c96b7f9bac114cd72f46 6. 56e7d6bd444cef8d879adc35dcf461cb4d2ed6d5 Repo for 5.2.3 [here](https://github.com/GZGavinZhao/solus-rocm) if anyone wants to take a stab.
celticmagic commented 1 year ago
Joey Riches (#joebonrichie), 2022-12-01 09:12:33 UTC

Feel free to package the rocm specific bundled llvm as for the time being if you want. I think it'll be at least till LLVM 17 that rocm and upstream llvm are better aligned.
celticmagic commented 1 year ago
TraceyC (@TraceyC77), 2023-06-17 05:32:01 UTC

@GZGavinZhao - is this task deprecated or still relevant after the recent work on the ROCm stack?
celticmagic commented 1 year ago
Rune Morling (@ermo), 2023-06-17 13:53:11 UTC

Re-assigning this to Jacek as this will be one of his responsibilities when he joins.
celticmagic commented 1 year ago
Jacek Jagosz (@JacekJagosz), 2023-06-17 16:50:02 UTC

There are 2 things to be worked on with ROCm: - Make Blender build with our ROCm stack. No additional dependencies are needed, this looks like something with Blender's build system, but might need some tweaks to -HIP - Package all dependencies necessary for likes of pyTorch or Tensorflow. Even if we do all the work it will still need to be decided if we want to enable it in our repo, as building kernels for all GPU architectures will take a lot of time - When new LLVM eventually comes we will have to update to ROCm 5.4.x Gavin has done a lot of work on packaging Tensorflow already, and we both spend some time trying to make Blender build. Not sure who will take those tasks on, or when
davidjharder commented 11 months ago

Hey @GZGavinZhao can you take a look at this. I ticked off Blender