ROCm / clr

MIT License
90 stars 46 forks source link

[Issue]: Failing Unit Tests #84

Closed pvelesko closed 5 days ago

pvelesko commented 4 months ago

Problem Description

The following tests FAILED:
         24 - Unit_deviceAllocation_Malloc_PerThread_PrimitiveDataType (Failed)
         25 - Unit_deviceAllocation_New_PerThread_PrimitiveDataType (Failed)
         26 - Unit_deviceAllocation_Malloc_PerThread_StructDataType (Failed)
         27 - Unit_deviceAllocation_New_PerThread_StructDataType (Failed)
         28 - Unit_deviceAllocation_InOneThread_AccessInAllThreads (Failed)
         29 - Unit_deviceAllocation_Malloc_AcrossKernels (Failed)
         30 - Unit_deviceAllocation_New_AcrossKernels (Failed)
         31 - Unit_deviceAllocation_Malloc_ComplexDataType (Failed)
         32 - Unit_deviceAllocation_New_ComplexDataType (Failed)
         33 - Unit_deviceAllocation_Malloc_UnionType (Failed)
         34 - Unit_deviceAllocation_New_UnionType (Failed)
         35 - Unit_deviceAllocation_Malloc_SingleCodeObj (Failed)
         36 - Unit_deviceAllocation_New_SingleCodeObj (Failed)
         37 - Unit_deviceAllocation_Malloc_PerThread_Graph (Failed)
         38 - Unit_deviceAllocation_New_PerThread_Graph (Failed)
         39 - Unit_deviceAllocation_Malloc_DeviceFunc (Failed)
         40 - Unit_deviceAllocation_New_DeviceFunc (Failed)
         41 - Unit_deviceAllocation_VirtualFunction (Failed)
         42 - Unit_deviceAllocation_Malloc_MulKernels_MulThreads (Failed)
         43 - Unit_deviceAllocation_New_MulKernels_MulThreads (Failed)
         44 - Unit_deviceAllocation_Malloc_SingKernels_MulThreads (Failed)
         45 - Unit_deviceAllocation_New_SingKernels_MulThreads (Failed)
         46 - Unit_deviceAllocation_Malloc_MulCodeObj (Failed)
         47 - Unit_deviceAllocation_New_MulCodeObj (Failed)
        631 - Unit_hipMemPrefetchAsync_NonPageSz (Failed)
        946 - Unit_hipMemPrefetchAsync_Basic (Failed)
        1044 - Unit_hipHostMalloc_CoherentTst (Bus error)
        1045 - Unit_hipMallocManaged_CoherentTst (Bus error)
        1322 - Unit_printf_flags (Failed)
        1323 - Unit_printf_specifier (Failed)
        1615 - Unit_hipStreamPerThread_MangdMem (Failed)
        1659 - Unit_hipCGMultiGridGroupType (Bus error)
        1660 - Unit_hipCGMultiGridGroupType_BaseType (Bus error)
        1661 - Unit_hipCGMultiGridGroupType_PublicApi (Bus error)
        1662 - Unit_coalesced_groups_shfl_down (Failed)
        1663 - Unit_coalesced_groups_shfl_up (Failed)
        1664 - Unit_coalesced_groups (Failed)
        1711 - Unit_hipHostMalloc_WthEnv1 (Failed)
        1712 - Unit_hipHostMalloc_WthEnv1Flg1 (Failed)
        1713 - Unit_hipHostMalloc_WthEnv1Flg2 (Failed)
        1714 - Unit_hipHostMalloc_WthEnv1Flg3 (Failed)

Operating System

35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

CPU

13th Gen Intel(R) Core(TM) i9-13900K

GPU

AMD Radeon VII

ROCm Version

ROCm 5.7.1

ROCm Component

HIP

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

I've opened an issue on hip-tests but got no reply: https://github.com/ROCm/hip-tests/issues/462 On there, I've tested ROCm 6.1

Since last official release for gfx906 was on ROCm 5.7, I've downgraded the installation and ran the tests again.

╭─pvelesko@cupcake ~/HIPAMD/hip-tests ‹rocm-5.7.x›
╰─$ dpkg -l | grep rocm                                                                                                                                                                                                                        130 ↵
ii  rocm-clang-ocl                             0.5.0.50700-63~20.04                                            amd64        OpenCL compilation with clang compiler.
ii  rocm-cmake                                 0.10.0.50700-63~20.04                                           amd64        rocm-cmake built using CMake
ii  rocm-core                                  5.7.0.50700-63~20.04                                            amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-dbgapi                                0.70.1.50700-63~20.04                                           amd64        Library to provide AMD GPU debugger API
ii  rocm-debug-agent                           2.0.3.50700-63~20.04                                            amd64        Radeon Open Compute Debug Agent (ROCdebug-agent)
ii  rocm-dev                                   5.7.0.50700-63~20.04                                            amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-device-libs                           1.0.0.50700-63~20.04                                            amd64        Radeon Open Compute - device libraries
ii  rocm-dkms                                  5.7.0.50700-63~20.04                                            amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-gdb                                   13.2.50700-63~20.04                                             amd64        ROCgdb
ii  rocm-llvm                                  17.0.0.23352.50700-63~20.04                                     amd64        ROCm compiler
ii  rocm-ocl-icd                               2.0.0.50700-63~20.04                                            amd64        clr built using CMake
ii  rocm-opencl                                2.0.0.50700-63~20.04                                            amd64        clr built using CMake
ii  rocm-opencl-dev                            2.0.0.50700-63~20.04                                            amd64        clr built using CMake
ii  rocm-smi-lib                               5.0.0.50700-63~20.04                                            amd64        AMD System Management libraries
ii  rocm-utils                                 5.7.0.50700-63~20.04                                            amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocminfo                                   1.0.0.50700-63~20.04                                            amd64        Radeon Open Compute (ROCm) Runtime rocminfo tool
cjatin commented 3 months ago

Looks like an PCIe atomics issue.

Can you share a few more details: What PCI-e gen you are on? I think Radeon VII supports 3.0.

Also is Large BAR enabled? It will be "4G decode" or something in your motherboard bios menu.

pvelesko commented 3 months ago

Can you share a few more details: What PCI-e gen you are on? I think Radeon VII supports 3.0.

https://www.gigabyte.com/Motherboard/B660-DS3H-AC-DDR4-rev-10-12#kf

PCIe 4.0

Also is Large BAR enabled?

yes

harkgill-amd commented 5 days ago

Hi @pvelesko, let's use https://github.com/ROCm/hip-tests/issues/462 to continue investigating this issue.