ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.
MIT License
218 stars 147 forks source link

Use fallback libraries for archs without optimized logic (v2) #1897

Closed GZGavinZhao closed 7 months ago

GZGavinZhao commented 7 months ago

Fixes #1757. Reintroducing #1862.

Enables architectures that don't have optimized logic files to also produce libraries when --separate-architectures or --lazy-library-loading is turned on. Previously, one must disable both of these two flags in order for rocBLAS to run on architectures like gfx1010.

Previously, there was a bug in Tensile solution indexing that caused #1862 to be reverted. Now, it seems like this issue has been fixed in #1888.

Test plan:

cmake -GNinja -B build -S . \
    -DCMAKE_C_COMPILER=hipcc \
    -DCMAKE_CXX_COMPILER=hipcc \
    -DBUILD_CLIENTS_TESTS=ON \
    -DBUILD_CLIENTS_BENCHMARKS=OFF \
    -DBUILD_CLIENTS_SAMPLES=OFF \
    -DBUILD_TESTING=ON \
    -DBUILD_WITH_TENSILE=ON \
    -DTensile_PRINT_DEBUG=ON \
    -DTensile_LIBRARY_FORMAT=msgpack \
    -DTensile_CPU_THREADS=14 \
    -DTensile_LAZY_LIBRARY_LOADING=ON \
    -DAMDGPU_TARGETS="..."

With AMDGPU_TARGETS being one of the following

In all cases, $ROCM_PATH/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat is produced and all other *.dat files remain unchanged.

In the second case, ./build/clients/staging/rocblas-test --gtest_filter='*gemm_ex_get_solutions*' that previously failed now passes. I cannot run the full test suite due to limited memory on my GPU (I often get hipOutOfMemory when running stress tests). If this PR doesn't cause extra failures on AMD's CI or if someone can run the full test suite to ensure no additional failures are introduced, then I believe this PR should be good to go. Hopefully this PR can make it in before ROCm 6.1.

hiepxanh commented 7 months ago

Yes, you are my hope ❤️❤️❤️❤️❤️ @GZGavinZhao

GZGavinZhao commented 7 months ago

TensileLibrary_gfx1010.co is not produced while TensileLibrary_gfx1030.co is. Is this expected? I tried running tests emulating my gfx1032 as gfx1010 and it passed, so I think this is fine? Nevermind I was building a wrong configuration.

wangxing7714436 commented 3 months ago

Anyone you guys could upload gfx1010.dat before this pr merge into rocm,please? couldn't wait for running ollama with my 5700xt. Many Thanks. @GZGavinZhao

GZGavinZhao commented 3 months ago

@wangxing7714436 I can't guarantee it will work, but I can give it a try. What ROCm version do you have?

wangxing7714436 commented 3 months ago

@wangxing7714436 I can't guarantee it will work, but I can give it a try. What ROCm version do you have?

I installed 5.7 on windows, no tensile file on gfx1010. Many thanks for your reply. @GZGavinZhao

GZGavinZhao commented 3 months ago

@wangxing7714436 Uh this is a little tricky. I'm not familiar with how rocBLAS works on Windows. If you can't find files like TensileLibrary_Type_4xi8I_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_fallback_gfx1010.hsaco, then I'm almost certain this won't work for you. If you can find these files, then in the same directory where you found these files, extract then put the attached .dat file there and see if it will work. If this doesn't work, then I think you would unfortunately have to wait until the next ROCm Windows SDK release.

TensileLibrary_lazy_gfx1010.dat.zip