ROCm / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
5 stars 3 forks source link

Tried to build and install DeepSpeed, but it reports error. #26

Closed terU3760 closed 2 years ago

terU3760 commented 3 years ago

On my platform when run the command: /opt/rocm/bin/rocminfo , it outputs:

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 2700X Eight-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 2700X Eight-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3700                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32871168(0x1f59300) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32871168(0x1f59300) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-5ad8292173c64529               
  Marketing Name:          Vega 20 [Radeon VII]               
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26287(0x66af)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   2560                               
  Internal Node ID:        1                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc-:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-0294892173c71c06               
  Marketing Name:          Vega 20 [Radeon VII]               
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 26287(0x66af)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   3328                               
  Internal Node ID:        2                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc-:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

. When input the command: /opt/rocm/opencl/bin/clinfo , it outputs:

Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (3275.0)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               2
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    Vega 20 [Radeon VII]
  Device Topology:               PCI[ B#10, D#0, F#0 ]
  Max compute units:                 60
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1801Mhz
  Address bits:                  64
  Max memory allocation:             14588628168
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                16384
  Max image 3D height:               16384
  Max image 3D depth:                8192
  Max samplers within kernel:            26287
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                17163091968
  Constant buffer size:              14588628168
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 65536
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              1703726280
  Max global variable size:          14588628168
  Max global variable preferred total size:  17163091968
  Max read/write image args:             64
  Max on device events:              1024
  Queue on device max size:          8388608
  Max on device queues:              1
  Queue on device preferred size:        262144
  SVM capabilities:              
    Coarse grain buffer:             Yes
    Fine grain buffer:               Yes
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x7f860713cdd0
  Name:                      gfx906:sramecc-:xnack-
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                3275.0 (HSA1.1,LC)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 2.0 
  Extensions:                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     1002h
  Board name:                    Vega 20 [Radeon VII]
  Device Topology:               PCI[ B#13, D#0, F#0 ]
  Max compute units:                 60
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               1024
  Max work group size:               256
  Preferred vector width char:           4
  Preferred vector width short:          2
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          4
  Native vector width short:             2
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1801Mhz
  Address bits:                  64
  Max memory allocation:             14588628168
  Image support:                 Yes
  Max number of images read arguments:       128
  Max number of images write arguments:      8
  Max image 2D width:                16384
  Max image 2D height:               16384
  Max image 3D width:                16384
  Max image 3D height:               16384
  Max image 3D depth:                8192
  Max samplers within kernel:            26287
  Max size of kernel argument:           1024
  Alignment (bits) of base address:      1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               64
  Cache size:                    16384
  Global memory size:                17163091968
  Constant buffer size:              14588628168
  Max number of constant args:           8
  Local memory type:                 Scratchpad
  Local memory size:                 65536
  Max pipe arguments:                16
  Max pipe active reservations:          16
  Max pipe packet size:              1703726280
  Max global variable size:          14588628168
  Max global variable preferred total size:  17163091968
  Max read/write image args:             64
  Max on device events:              1024
  Queue on device max size:          8388608
  Max on device queues:              1
  Queue on device preferred size:        262144
  SVM capabilities:              
    Coarse grain buffer:             Yes
    Fine grain buffer:               Yes
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     64
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:                
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:              
    Out-of-Order:                No
    Profiling :                  Yes
  Queue on Device properties:                
    Out-of-Order:                Yes
    Profiling :                  Yes
  Platform ID:                   0x7f860713cdd0
  Name:                      gfx906:sramecc-:xnack-
  Vendor:                    Advanced Micro Devices, Inc.
  Device OpenCL C version:           OpenCL C 2.0 
  Driver version:                3275.0 (HSA1.1,LC)
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 2.0 
  Extensions:                    cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 

. After running the following commands:

git clone https://github.com/ROCmSoftwarePlatform/DeepSpeed
cd DeepSpeed/
cd DeepSpeedExamples/
git submodule update --init --recursive
cd ..
DS_BUILD_FUSED_ADAM=1 DS_BUILD_FUSED_LAMB=1 DS_BUILD_CPU_ADAM=1 DS_BUILD_TRANSFORMER=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1 DS_BUILD_UTILS=1 ./install.sh --allow_sudo

It starts building and reported the following error:

/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:26:38: error: no member named 'cg_coalesced_tile' in namespace 'cooperative_groups::internal'
    cg::thread_group g(cg::internal::cg_coalesced_tile, tbSize);
                       ~~~~~~~~~~~~~~^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:27:7: error: no member named 'tiled_partition' in 'cooperative_groups::thread_group'
    g.tiled_partition(b, tbSize);
    ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:72:23: error: no member named 'shfl_xor' in 'cooperative_groups::thread_group'
        auto temp = g.shfl_xor(max_val, i);
                    ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:91:27: error: no member named 'shfl_xor' in 'cooperative_groups::thread_group'
            auto temp = g.shfl_xor(max_val, i);
                        ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:95:21: error: no member named 'shfl' in 'cooperative_groups::thread_group'
        max_val = g.shfl(max_val, threadIdx.x / tbSize);
                  ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:108:52: error: no member named 'shfl_xor' in 'cooperative_groups::thread_group'
    for (int i = 1; i < tbSize; i *= 2) { sum += g.shfl_xor(sum, i); }
                                                 ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:124:55: error: no member named 'shfl_xor' in 'cooperative_groups::thread_group'
        for (int i = 1; i < iters; i *= 2) { sum += g.shfl_xor(sum, i); }
                                                    ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:126:17: error: no member named 'shfl' in 'cooperative_groups::thread_group'
        sum = g.shfl(sum, threadIdx.x / tbSize);
              ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:455:38: error: no member named 'cg_coalesced_tile' in namespace 'cooperative_groups::internal'
    cg::thread_group g(cg::internal::cg_coalesced_tile, tbSize);
                       ~~~~~~~~~~~~~~^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:456:7: error: no member named 'tiled_partition' in 'cooperative_groups::thread_group'
    g.tiled_partition(b, tbSize);
    ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:480:55: error: no member named 'shfl_xor' in 'cooperative_groups::thread_group'
    for (int i = 1; i < tbSize; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
                                                    ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:491:58: error: no member named 'shfl_xor' in 'cooperative_groups::thread_group'
        for (int i = 1; i < iters; i *= 2) grad_reg += g.shfl_xor(grad_reg, i);
                                                       ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:493:22: error: no member named 'shfl' in 'cooperative_groups::thread_group'
        grad_reg = g.shfl(grad_reg, id / tbSize);
                   ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:532:38: error: no member named 'cg_coalesced_tile' in namespace 'cooperative_groups::internal'
    cg::thread_group g(cg::internal::cg_coalesced_tile, WARP_SIZE);
                       ~~~~~~~~~~~~~~^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:533:7: error: no member named 'tiled_partition' in 'cooperative_groups::thread_group'
    g.tiled_partition(b, WARP_SIZE);
    ~ ^
/******/DeepSpeed/csrc/transformer/softmax_kernels.hip:535:54: error: no member named 'shfl_xor' in 'cooperative_groups::thread_group'
    for (int i = 1; i < WARP_SIZE; i <<= 1) sum += g.shfl_xor(sum, i);
                                                   ~ ^
16 errors generated when compiling for gfx803.
error: command '/opt/rocm-4.2.0/bin/hipcc' failed with exit status 1
Error on line 155
Fail to install deepspeed

What is the cause and how to fix it?

terU3760 commented 3 years ago

Have also done some hackers on the two files: hip_cooperative_groups.h and hip_cooperative_groups_helper.h in the directory "DeepSpeed/csrc/includes/patch/hip/hcc_detail". But still reports some error as:

/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:34:38: error: no member named 'cg_coalesced_tile' in namespace 'cooperative_groups::internal'
    cg::thread_group g(cg::internal::cg_coalesced_tile, WARP_SIZE);
                       ~~~~~~~~~~~~~~^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:35:7: error: no member named 'tiled_partition' in 'cooperative_groups::thread_group'
    g.tiled_partition(b, WARP_SIZE);
    ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:60:48: error: no member named 'shfl_down' in 'cooperative_groups::thread_group'
    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
                                             ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:72:69: error: no member named 'shfl_down' in 'cooperative_groups::thread_group'
    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { sum += g.shfl_down(sum, i); }
                                                                  ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:74:13: error: no member named 'shfl' in 'cooperative_groups::thread_group'
    sum = g.shfl(sum, 0);
          ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:84:53: error: no member named 'shfl_down' in 'cooperative_groups::thread_group'
    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
                                                  ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:96:74: error: no member named 'shfl_down' in 'cooperative_groups::thread_group'
    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { variance += g.shfl_down(variance, i); }
                                                                       ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:97:18: error: no member named 'shfl' in 'cooperative_groups::thread_group'
    variance = g.shfl(variance, 0);
               ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:324:38: error: no member named 'cg_coalesced_tile' in namespace 'cooperative_groups::internal'
    cg::thread_group g(cg::internal::cg_coalesced_tile, 32);
                       ~~~~~~~~~~~~~~^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:325:7: error: no member named 'tiled_partition' in 'cooperative_groups::thread_group'
    g.tiled_partition(b, 32);
    ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:350:48: error: no member named 'shfl_down' in 'cooperative_groups::thread_group'
    for (int i = 1; i < 32; i *= 2) { sum += g.shfl_down(sum, i); }
                                             ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:362:69: error: no member named 'shfl_down' in 'cooperative_groups::thread_group'
    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { sum += g.shfl_down(sum, i); }
                                                                  ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:364:13: error: no member named 'shfl' in 'cooperative_groups::thread_group'
    sum = g.shfl(sum, 0);
          ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:372:53: error: no member named 'shfl_down' in 'cooperative_groups::thread_group'
    for (int i = 1; i < 32; i *= 2) { variance += g.shfl_down(variance, i); }
                                                  ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:384:74: error: no member named 'shfl_down' in 'cooperative_groups::thread_group'
    for (int i = 1; i < (iteration_stride >> 5); i *= 2) { variance += g.shfl_down(variance, i); }
                                                                       ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:385:18: error: no member named 'shfl' in 'cooperative_groups::thread_group'
    variance = g.shfl(variance, 0);
               ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:632:38: error: no member named 'cg_coalesced_tile' in namespace 'cooperative_groups::internal'
    cg::thread_group g(cg::internal::cg_coalesced_tile, TILE_DIM);
                       ~~~~~~~~~~~~~~^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:633:7: error: no member named 'tiled_partition' in 'cooperative_groups::thread_group'
    g.tiled_partition(b, TILE_DIM);
    ~ ^
/******/DeepSpeed/csrc/transformer/normalize_kernels.hip:669:17: error: no member named 'shfl_down' in 'cooperative_groups::thread_group'
        s1 += g.shfl_down(s1, i);
              ~ ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated when compiling for gfx803.
error: command '/opt/rocm-4.2.0/bin/hipcc' failed with exit status 1
Error on line 155
Fail to install deepspeed
jithunnair-amd commented 2 years ago

Hi, sorry we missed this. You are correct that the cooperative_groups headers need some hacks to work, so we'd recommend using the following Dockerfile to build DeepSpeed: https://github.com/ROCmSoftwarePlatform/DeepSpeed/blob/master/docker/Dockerfile.rocm