Open ajaypanyala opened 2 years ago
on frontier, using rocm 5.7 (5.6 should be the same), the sample code appears to work properly :
Number of NVIDIA GPU found on node = 8
TAL-SH has been initialized: Status 0: Host buffer size = 1072693248
Three TAL-SH tensor blocks have been constructed: Volumes: 360000, 810000, 360000: GFlops = 0.648000
Tensor contraction has been scheduled for execution: Status 0
Tensor contraction has completed successfully: Status 2000005: Time 1.461498 sec
Tensor contraction total time = 0.280260: GFlop/s = 2.312139
Tensor result was moved back to Host: Norm1 = 6.480000E+03: Correct = 6.480000E+03
Three external tensor blocks have been unregistered with TAL-SH
TAL-SH has been shut down: Status 0
using rocm 5.1.0, the execution goes to the end but norm1 does not match the reference value:
Number of NVIDIA GPU found on node = 8
TAL-SH has been initialized: Status 0: Host buffer size = 1072693248
Three TAL-SH tensor blocks have been constructed: Volumes: 360000, 810000, 360000: GFlops = 0.648000
Tensor contraction has been scheduled for execution: Status 0
Tensor contraction has completed successfully: Status 2000005: Time 9.924454 sec
Tensor contraction total time = 9.105721: GFlop/s = 0.071164
Tensor result was moved back to Host: Norm1 = 1.383536E-01: Correct = 6.480000E+03
Three external tensor blocks have been unregistered with TAL-SH
TAL-SH has been shut down: Status 0
which is also the case for rocm 5.4.3:
Number of NVIDIA GPU found on node = 8
#WARNING(tensor_algebra_gpu_nvidia:init_gpus): Unable to set GPU SHMEM width 8: Error 2
#WARNING(tensor_algebra_gpu_nvidia:init_gpus): Unable to set GPU SHMEM width 8: Error 2
#WARNING(tensor_algebra_gpu_nvidia:init_gpus): Unable to set GPU SHMEM width 8: Error 2
#WARNING(tensor_algebra_gpu_nvidia:init_gpus): Unable to set GPU SHMEM width 8: Error 2
#WARNING(tensor_algebra_gpu_nvidia:init_gpus): Unable to set GPU SHMEM width 8: Error 2
#WARNING(tensor_algebra_gpu_nvidia:init_gpus): Unable to set GPU SHMEM width 8: Error 2
#WARNING(tensor_algebra_gpu_nvidia:init_gpus): Unable to set GPU SHMEM width 8: Error 2
#WARNING(tensor_algebra_gpu_nvidia:init_gpus): Unable to set GPU SHMEM width 8: Error 2
TAL-SH has been initialized: Status 0: Host buffer size = 1072693248
Three TAL-SH tensor blocks have been constructed: Volumes: 360000, 810000, 360000: GFlops = 0.648000
Tensor contraction has been scheduled for execution: Status 0
Tensor contraction has completed successfully: Status 2000005: Time 1.028282 sec
Tensor contraction total time = 0.329818: GFlop/s = 1.964722
Tensor result was moved back to Host: Norm1 = 4.352281E-03: Correct = 6.480000E+03
Three external tensor blocks have been unregistered with TAL-SH
TAL-SH has been shut down: Status 0
i've observed that for rocm below 5.6 there's an issue with the optimization levels beyond -O1 with hipcc that can result in runtime errors such as
:0:rocdevice.cpp :2614: 3360862704560 us: 16818: [tid:0x7fff99b97700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
when running test_talsh.x
so coming back to rocm 5.1.0 but reducing the optimization to -O1 for hipcc results in a norm1 that matches the reference :
Number of NVIDIA GPU found on node = 8
TAL-SH has been initialized: Status 0: Host buffer size = 1072693248
Three TAL-SH tensor blocks have been constructed: Volumes: 360000, 810000, 360000: GFlops = 0.648000
Tensor contraction has been scheduled for execution: Status 0
Tensor contraction has completed successfully: Status 2000005: Time 9.875879 sec
Tensor contraction total time = 9.014154: GFlop/s = 0.071887
Tensor result was moved back to Host: Norm1 = 6.480000E+03: Correct = 6.480000E+03
Three external tensor blocks have been unregistered with TAL-SH
TAL-SH has been shut down: Status 0
so my impression from all of this is that the issue looks to be due to rocm rather than in tal-sh, and that it could be interesting to list rocm 5.6 as a minimum requirement (and/or indicating the decrease in optimization level for earlier rocm versions).
I have looked at this again, and as it turns out rocm 5.6.0 still shows the issue with yielding an incorrect Norm1 values unless -O1 is used. code compiled with BUILD_TYPE=PRF:
-O3
Number of NVIDIA GPU found on node = 8
TAL-SH has been initialized: Status 0: Host buffer size = 1072693248
Three TAL-SH tensor blocks have been constructed: Volumes: 360000, 810000, 360000: GFlops = 0.648000
Tensor contraction has been scheduled for execution: Status 0
Tensor contraction has completed successfully: Status 2000005: Time 0.059791 sec
Tensor contraction total time = 0.280899: GFlop/s = 2.306882
Tensor result was moved back to Host: Norm1 = 9.368648E-02: Correct = 6.480000E+03
Three external tensor blocks have been unregistered with TAL-SH
TAL-SH has been shut down: Status 0
-O2
Number of NVIDIA GPU found on node = 8
TAL-SH has been initialized: Status 0: Host buffer size = 1072693248
Three TAL-SH tensor blocks have been constructed: Volumes: 360000, 810000, 360000: GFlops = 0.648000
Tensor contraction has been scheduled for execution: Status 0
Tensor contraction has completed successfully: Status 2000005: Time 0.056695 sec
Tensor contraction total time = 0.321033: GFlop/s = 2.018483
Tensor result was moved back to Host: Norm1 = 4.342731E-01: Correct = 6.480000E+03
Three external tensor blocks have been unregistered with TAL-SH
TAL-SH has been shut down: Status 0
-O1
Number of NVIDIA GPU found on node = 8
TAL-SH has been initialized: Status 0: Host buffer size = 1072693248
Three TAL-SH tensor blocks have been constructed: Volumes: 360000, 810000, 360000: GFlops = 0.648000
Tensor contraction has been scheduled for execution: Status 0
Tensor contraction has completed successfully: Status 2000005: Time 0.059132 sec
Tensor contraction total time = 0.307055: GFlop/s = 2.110372
Tensor result was moved back to Host: Norm1 = 6.480000E+03: Correct = 6.480000E+03
Three external tensor blocks have been unregistered with TAL-SH
TAL-SH has been shut down: Status 0
with that, i would amend the suggestion to indicate rocm 5.6 still requires the workaround, and rocm 5.7.0 as minimum requirement without it.
I see error messages when contracting tensors that are of type
complex double (C8)
on AMD GPUs.I consistently see this error with rocm versions
4.5.0
,4.5.2
and5.1.0
.Below is a slimmer version of
test.cpp
which only runs thetest_talsh_c
routine. Additionally changed theR8
occurrences toC8
to reproduce the error. It looks like call to gpu_tensor_block_contract_dlf is where things go wrong. This call returns a task error code that is > 0 for when the tensor type isC8
.