ROCm / rocFFT

Next generation FFT implementation for ROCm
https://rocm.docs.amd.com/projects/rocFFT/en/latest/
Other
175 stars 84 forks source link

rocfft_aot_helper input checking #430

Closed trixirt closed 8 months ago

trixirt commented 1 year ago

There are several problems

On Fedora, building rocFFT master branch against Fedora's ROCm packages There is this failure

FAILED: library/src/rocfft_kernel_cache.db /home/trix/work/rocm/rocFFT/build/library/src/rocfft_kernel_cache.db
cd /home/trix/work/rocm/rocFFT/build/library/src && /home/trix/work/rocm/rocFFT/build/library/src/rocfft_aot_helper "" /home/trix/work/rocm/rocFFT/build/library/src/rocfft_kernel_cache.db /home/trix/work/rocm/rocFFT/build/library/src/rocfft_rtc_helper gfx906 gfx908 gfx90a gfx1030 gfx1100 gfx1101 gfx1102

Which maps to this command COMMAND rocfft_aot_helper \"${ROCFFT_BUILD_KERNEL_CACHE_PATH}\" ${ROCFFT_KERNEL_CACHE_PATH} $ ${AMDGPU_TARGETS_AOT}

The cmake variable ROCFFT_BUILD_KERNEL_CACHE_PATH is required but not checked for.

The input checking on rocfft_aot_helper does not cache this because it is only

int main(int argc, char** argv)
{
if(argc < 5)
{
puts("Usage: rocfft_aot_helper temp_cachefile.db output_cachefile.db "
"path/to/rocfft_rtc_helper gfx000 gfx001 ...");
return 1;
}

This checking should be improved to check if arguments intended to be files are files and gpu target are targets.

If this offline compiling is optional, and it seems to be, then it should be controlled by some cmake variable like rocBLAS does with -DBUILD_WITH_TENSILE=OFF

evetsso commented 1 year ago

Hi @trixirt , thanks for the bug report.

Do you have the whole build log, and can you attach it here if so? It seems likely that this is the same issue as #422, so I would like to confirm if the process is segfaulting.

rocfft_aot_helper should already be gracefully handling an empty ROCFFT_BUILD_KERNEL_CACHE_PATH, so I'm skeptical that this is the root cause of your problem.

trixirt commented 1 year ago

I was looking into the seg faulting issue as that is a blocker for getting rocFFT into Fedora. I am not sure if this issue is related, but it is not clear to me if the cache generation ever worked. As-is it will fall over unless the config stage sets the ROCFFT_BUILD_KERNEL_CACHE_PATH which is hidden build option with no default. I see this is set by some of .jenkins/* files, but my expectation is the user should not have to dig through these to see how to build the project. Some rough hacking out of the cache generation appears to get past the the seg fault so i am looking at the build option to bypass it.

cgmb commented 1 year ago

rocfft_aot_helper should already be gracefully handling an empty ROCFFT_BUILD_KERNEL_CACHE_PATH

I haven't tried with rocfft from ROCm 5.6, but I can confirm that it did in ROCm 5.5.

evetsso commented 1 year ago

ROCFFT_BUILD_KERNEL_CACHE_PATH is optional and allowed to be empty. It could be documented better, but AFAICT it's not related to the problem you're observing.

It sounds like the segfault in #422 is the root cause here. Modifying the build to not generate the cache will bypass the segfault at build time. But even though that would work, I don't think Fedora (or anyone) should seriously consider doing this, as it would mean:

The effort would be better spent, I think, on fixing the root cause.

evetsso commented 8 months ago

Closing, as the root cause here was in #422. 984913b287b460da0d7ea08c56432de974b1dce6 has further clarified that an empty value for ROCFFT_BUILD_KERNEL_CACHE_PATH is allowed, which should address the confusion here.