ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform
https://rocm.docs.amd.com/projects/rocBLAS/en/latest/
Other
336 stars 157 forks source link

Hotfix guard kernel launches #1364

Closed TorreZuk closed 10 months ago

TorreZuk commented 10 months ago
amcamd commented 10 months ago

Can you add a section to the file docs/API_Reference_Guide.rst after the section Asynchronous API. It should explain use of hipPeakLastError, something like below, but please edit below:

Kernel launch status error checking ^^^^^^^^^^^^^^^^^^^^^^^ The function hipPeakLastError() is called before and after kernel launches. This will detect if launch parameters are incorrect, for example exceeding max work-group size, max number of work-items. It will also detect if the code is running on the incorrect gpu. Note that hipPeakLastError() does not flush the last error. The use of hipPeakLastError() has the disadvantage that if the previous last error from another kernel launch is the same as the error from the current kernel, then no error is reported. You can prevent this by flushing the last error before calling a rocBLAS function with hipGetLastError(). Note that both hipPeakLastError() and hipGetLastError() run synchronously on the CPU and they only check the kernel launch, not the asynchronous work done by the kernel.

TorreZuk commented 10 months ago

Okay can add something like that. I think it checks thread limits but not work item limits. The other launch styles check max workgroups. But 0 work groups is checked. ... the pre-existing error may not be from a kernel launch, but for it to match errors may be very unlikely

amcamd commented 10 months ago

OK, it will be best to list only some of the things it is known to check (We do not want to claim it checks something we do not know it checks, and we do not want to document hipPeakLastError() . We do not document what we do not implement :)

TorreZuk commented 10 months ago

OK, it will be best to list only some of the things it is known to check (We do not want to claim it checks something we do not know it checks, and we do not want to document hipPeakLastError() . We do not document what we do not implement :)

You can review last commit