ROCm / rocBLAS

Next generation BLAS implementation for ROCm platform
https://rocm.docs.amd.com/projects/rocBLAS/en/latest/
Other
340 stars 157 forks source link

[Bug]: Incorrect results when using GPUs with different architectures #1346

Closed opcod3 closed 6 months ago

opcod3 commented 1 year ago

Describe the bug

rocBLAS returns incorrect results when used on two GPUs with different architectures.

This issue was first encountered in turboderp/exllama#173, while the provided code to reproduce is based off of rocBLAS-Examples.

When using rocBLAS and performing computations on two GPUs with different architectures the first computation on each card will be correct. While any subsequent ones performed on the first card will be incorrect.

To Reproduce

Steps to reproduce the behavior:

  1. Ensure the current system has at least two GPUs and that the architecture of GPU0 is different from GPU1

  2. Install ROCm and ROCblas v5.6.0 (also present on 5.5.1, possibly earlier as well)

  3. Run make to compile the example code (bug-report.zip)

  4. Run ./gemm

  5. Observe how the first two calculations pass while the all the subsequent ones that execute on GPU0 fail

Expected behavior

It is expected that all calculations complete correctly.

Log-files

Current device: 0 (gfx906:sramecc+:xnack-)
PASS: max. relative err. = 1.17549e-38

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Running AMD_LOG_LEVEL=2 ./gemm produces the following log

Current device: 0 (gfx906:sramecc+:xnack-)
:1:hip_code_object.cpp      :606 : 96578653125 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT64x16x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_2_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1 
:1:hip_module.cpp           :83  : 96578653147 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT64x16x8_SE_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL1_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB1_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT4_2_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1 for module: 0x2bbceeb0 

PASS: max. relative err. = 1.17549e-38

Current device: 1 (gfx1030)
:1:hip_code_object.cpp      :606 : 96578659797 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578659807 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2ccfe920 

PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578660829 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

:1:hip_code_object.cpp      :606 : 96578660844 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578660850 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbceeb0 

:1:hip_code_object.cpp      :606 : 96578660857 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578660866 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbee400 

FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578661515 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr
Error: Cannot Find Global Var Sizes
Error: Cannot create kernels.

:1:hip_code_object.cpp      :606 : 96578661526 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578661532 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbceeb0 

:1:hip_code_object.cpp      :606 : 96578661542 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 
:1:hip_module.cpp           :83  : 96578661549 us: 167308: [tid:0x7f2bf74c7c00] Cannot find the function: Cijk_Ailk_Bljk_SB_MT128x64x8_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTL0_DTVA0_DVO0_ETSP_EPS0_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR0_RK0_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TT8_4_TLDS0_USFGRO0_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_16_1_WGM8 for module: 0x2bbee400 

FAIL: max. relative err. = 0.5

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

I believe the key log entries are the following:

Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578660829 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr
Current device: 0 (gfx906:sramecc+:xnack-)
:1:devprogram.cpp           :1874: 96578661515 us: 167308: [tid:0x7f2bf74c7c00] Error: The program ISA amdgcn-amd-amdhsa--gfx1030 is not compatible with the device ISA amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-Error: create kernel metadata map using COMgr

Environment

Hardware description
CPU AMD Ryzen 9 3950X
GPU AMD Radeon VII
GPU AMD Radeon RX 6800 XT
Software version
rocm-core 5.6.0-1
rocblas 5.6.0-1

environment.txt

This has also been reproduced in the rocm/dev-ubuntu-22.04:5.5.1-complete docker container.

Additional context

According to other users in turboderp/exllama#173 the issues also occurs between Mi25 and Mi50 cards. I can also report it also occurs between any combination of the two cards I listed above and a 7900XTX.

Inverting the order of the computations (running a calculation on GPU1 first and then on GPU0) results in the same exact behavior, but with the failing card being GPU1 instead of GPU0 as before.

Current device: 1 (gfx1030)
PASS: max. relative err. = 1.17549e-38

Current device: 0 (gfx906:sramecc+:xnack-)
PASS: max. relative err. = 1.17549e-38

Current device: 1 (gfx1030)
FAIL: max. relative err. = 0.5

Current device: 0 (gfx906:sramecc+:xnack-)
PASS: max. relative err. = 1.17549e-38

From looking at more logs and rocBLAS internals i believe the error is related to the Tensile library. The behavior encountered seems to indicate that when a second .hsaco file is loaded it somehow overrides the original one with the correct architecture for the first card. I am unsure if this is an issue in Tensile itself or in the way rocBLAS uses it.

In my opinion attempting to execute a kernel with an incorrect architecture should produce a crash or an error, instead of carrying on as normal and returning incorrect results.

opcod3 commented 1 year ago

I just did some more tests and the issue can be reproduced in the following docker images:

rocm/dev-ubuntu-22.04:5.5.1-complete
rocm/dev-ubuntu-22.04:5.4.2-complete
rocm/dev-ubuntu-22.04:5.3-complete

I did not test any other versions but I assume the bug should be present in all versions since at least ROCm-5.3

IMbackK commented 1 year ago

I can confirm that this issue exists when the example above is executed with any combination of MI25, MI50 and rx6800xt but dose not exist (as expected) when only two MI50 are present.

opcod3 commented 1 year ago

Building rocBLAS without tensile (BUILD_WITH_TENSILE=OFF) appears to fix the issue

YellowRoseCx commented 1 year ago

I also have this issue with a 6800xt and a Vega64

And I also experience similar issues when using multi-gpu Torch with ROCm. Have a collection of my errors and debugging for the torch experience here: https://rentry.org/tcahd

opcod3 commented 1 year ago

Doing some more troubleshooting, apparently calling rocblas_initialize() before using any other functions fixes the issue.

YellowRoseCx commented 1 year ago

Doing some more troubleshooting, apparently calling rocblas_initialize() before using any other functions fixes the issue.

this is true for AMD but I've had people report that it will break hipBLAS usage on NVIDIA and Intel GPUs since its a call to rocblas instead of a hip function

IMbackK commented 1 year ago

its also a optional call, this is still a serious bug.

rkamd commented 1 year ago

Thanks for reporting the issue. We are currently investigating the issue and will provide an update as soon as possible. rocblas_initialize() does load all the Tensile code objects ( for all supported GFX ISA Targets), hence the results are as expected when rocblas_initialize() is used.

IMbackK commented 12 months ago

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

rkamd commented 11 months ago

@opcod3 , Thanks for bringing this to our notice, a fix has been merged and should be available in future release, rocBLAS Commit ID: https://github.com/ROCmSoftwarePlatform/rocBLAS/commit/bc4d8f57ec6b3b2c91c4eaa5351bcc35ced66d52 Tensile Commit ID: https://github.com/ROCmSoftwarePlatform/Tensile/commit/24d54d7644bd20e6855aa94a1262aae1d8269767

rkamd commented 11 months ago

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.

nktice commented 11 months ago

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.

Please note that this page explains the ROCm roadmap and current versions... https://github.com/RadeonOpenCompute/ROCm/releases Note 5.7 is the last in the series, based on their roadmap - and that 6.0 may not be comparable with the 5.x versions.
I'd like to suggest as it is a simple fix ( to existing bug ) and as there may not be a 5.7.1 version due to the roadmap, that is be added to the 5.7 version to minimize wait.

IMbackK commented 11 months ago

I would like to note that i find it pretty incredible that a rocm major release (5.7) was allowed to go forward with this extremely fundamental and trivially reproducible bug that simply breaks every single setup where heterogeneous architectures are in a system.

@IMbackK , Can you please use the workaround above for ROCm 5.7. A fix has been implemented and it should be in the next ROCm release.

@rkamd I can recompile rocblas/tensile with the patch, that is not the issue, i am merely worried about rocm stability policies as it would appear from the outside that there is no internal mechanism to block a release when a serious issue is found. I don't see what issue besides "silently returns incorrect results for every operation on a supported platform 100% of the time" could possibly be more serious in the world of scientific compute.

I also concur with @opcod3 that its worrying that the rocm runtime dose not throw an error when a kernel launch fails due to the arch being wrong, but instead silently continues with garbage data and only logs this as a warning. In my option a failed kernel launch of this kind should cause an assert. Please confirm whether you have raised this problem as a bug internally or not. As otherwise i would like to file a bug against the runtime.

I would also respectfully request that a system with heterogeneous architecture is included in internal conformance testing, if such a system is not available already.

That said, thank you for fixing this issue and including the unsupported legacy platforms in the fix, your (and AMD's in general) efforts in providing an open source compute platform are much appreciated. Indeed great progress has been made in this direction in recent years.

nktice commented 7 months ago

I'd like to report this issue appears resolved for me at this time! Here's the guide I wrote with the instructions I used and have it working - https://github.com/nktice/AMD-AI/blob/main/ROCm6.0.md

xiaobo1025 commented 7 months ago

First of all, this is the wrong report. A clear and concise description of what the problem is. -- OS detected is ubuntu /usr/bin/python3.8 -m venv /root/workspace/rocBLAS/build/virtualenv --system-site-packages --clear The virtual environment was not created successfully because ensurepip is not available. On Debian/Ubuntu systems, you need to install the python3-venv package using the following command.

apt install python3.8-venv You may need to use sudo with that command. After installing the python3-venv package, recreate your virtual environment.

Failing command: ['/root/workspace/rocBLAS/build/virtualenv/bin/python3.8', '-Im', 'ensurepip', '--upgrade', '--default-pip']

CMake Error at cmake/virtualenv.cmake:23 (message): 1 Call Stack (most recent call first): cmake/virtualenv.cmake:49 (virtualenv_create) CMakeLists.txt:139 (virtualenv_install)

-- Configuring incomplete, errors occurred! Then, I use the. apt update. apt install python3.8-venv. Update and install Fetched 5452 B in 1s (3877 B/s) Selecting previously unselected package python3.8-venv. (Reading database ... 49445 files and directories currently installed.) Preparing to unpack .../python3.8-venv_3.8.10-0ubuntu120.04.9_amd64.deb ... Unpacking python3.8-venv (3.8.10-0ubuntu120.04.9) ... Setting up python3.8-venv (3.8.10-0ubuntu1~20.04.9) ... And then, the error is as follows: raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.VersionConflict: (setuptools 44.0.0 (/root/workspace/rocBLAS/build/virtualenv/lib/python3.8/site-packages), Requirement.parse('setuptools>=62.4'))

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. CMake Error at cmake/virtualenv.cmake:68 (message): 1 Call Stack (most recent call first): CMakeLists.txt:139 (virtualenv_install) Then I use pip install-- upgrade setuptools. Update Installing collected packages: setuptools Attempting uninstall: setuptools Found existing installation: setuptools 69.0.2 Uninstalling setuptools-69.0.2: Successfully uninstalled setuptools-69.0.2 Successfully installed setuptools-69.0.3 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.3.1 -> 23.3.2 [notice] To update, run: python3 -m pip install --upgrade pip But,then still report the same error: raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.VersionConflict: (setuptools 44.0.0 (/root/workspace/rocBLAS/build/virtualenv/lib/python3.8/site-packages), Requirement.parse('setuptools>=62.4'))

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. CMake Error at cmake/virtualenv.cmake:68 (message): 1 Call Stack (most recent call first): CMakeLists.txt:139 (virtualenv_install)

xiaobo1025 commented 7 months ago

Could you help me to solve this problem ,thank you very much!

IMbackK commented 6 months ago

@xiaobo1025 please to dont spam this bug with unrelated issues

@rkamd I can confirm this seams to be fixed in 6.0

rkamd commented 6 months ago

@IMbackK , Thanks for verifying.

Closing this issue.