ROCm / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
126 stars 39 forks source link

[Issue]: 2.6.2-cktile - Won't build from source. #73

Open nktice opened 1 month ago

nktice commented 1 month ago

Problem Description

Unable to compile from sources... new python package errors don't give much info.

cd ~
# get sources... I've tried both of these, as first doesn't work, it is remarked out, in favor of the default install instructions
#git clone --recurse-submodules https://github.com/ROCm/flash-attention.git
git clone https://github.com/ROCm/flash-attention.git
cd flash-attention
# install... fails either, here's a couple things I've tried for posterity...
# pip install . 
python setup.py build

This is concluding with the following message that's lacking detail...

Successfully preprocessed all matching files.
Total number of unsupported CUDA function calls: 0

Total number of replaced kernel launches: 1
/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/setuptools/__init__.py:85: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
!!

        ********************************************************************************
        Requirements should be satisfied by a PEP 517 installer.
        If you are using pip, you can try `pip install --use-pep517`.
        ********************************************************************************

!!
  dist.fetch_build_eggs(dist.setup_requires)
Traceback (most recent call last):
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/pkg_resources/__init__.py", line 3395, in _dep_map
    return self.__dep_map
           ^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/pkg_resources/__init__.py", line 3175, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/n/flash-attention/setup.py", line 490, in <module>
    setup(
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/setuptools/__init__.py", line 107, in setup
    _install_setup_requires(attrs)
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/setuptools/__init__.py", line 80, in _install_setup_requires
    _fetch_build_eggs(dist)
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/setuptools/__init__.py", line 85, in _fetch_build_eggs
    dist.fetch_build_eggs(dist.setup_requires)
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/setuptools/dist.py", line 612, in fetch_build_eggs
    return _fetch_build_eggs(self, requires)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/setuptools/installer.py", line 38, in _fetch_build_eggs
    resolved_dists = pkg_resources.working_set.resolve(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/pkg_resources/__init__.py", line 901, in resolve
    new_requirements = dist.requires(req.extras)[::-1]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/pkg_resources/__init__.py", line 3098, in requires
    dm = self._dep_map
         ^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/pkg_resources/__init__.py", line 3397, in _dep_map
    self.__dep_map = self._compute_dependencies()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/pkg_resources/__init__.py", line 3414, in _compute_dependencies
    common = types.MappingProxyType(dict.fromkeys(reqs_for_extra(None)))
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/pkg_resources/__init__.py", line 3411, in reqs_for_extra
    if not req.marker or req.marker.evaluate({'extra': extra}):
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/packaging/markers.py", line 241, in evaluate
    return _evaluate_markers(self._markers, current_environment)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/packaging/markers.py", line 150, in _evaluate_markers
    lhs_value, rhs_value = _normalize(lhs_value, rhs_value, key=environment_key)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/packaging/markers.py", line 124, in _normalize
    return tuple(canonicalize_name(v) for v in values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/packaging/markers.py", line 124, in <genexpr>
    return tuple(canonicalize_name(v) for v in values)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/fa/lib/python3.11/site-packages/packaging/utils.py", line 34, in canonicalize_name
    value = _canonicalize_regex.sub("-", name).lower()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected string or bytes-like object, got 'NoneType'

Operating System

NAME="Ubuntu" VERSION="24.04 LTS (Noble Numbat)"

CPU

model name : AMD Ryzen 9 5950X 16-Core Processor

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.1.0

ROCm Component

No response

Steps to Reproduce

I've been maintaining instructions for installing a few AI tools on AMD gear, at the following address ( elsewhere on github ) http://github.com/nktice/AMD-AI There's a few versions based on different configs - note this dev version https://github.com/nktice/AMD-AI/blob/main/dev.md It has the commands for where flash-attention was being installed from. In stable version I had removed this section, due to FA's age, with the new version, I would like to add it back, once it is working.
As these instructions are no longer functional I thought I would write, as it is likely others will have these issues and may like to know what resolution may be found for these problems so they can use it.

With the new version I can not build from sources ( older versions did ).
Note message above in the problem description for text from console...

Could you please post the instructions you are using to get it to build - and requirements that are possibly missing that could be at fault?

Tried working around using pre-built packages that have been posted. I've written this page to install from packages that you have posted... https://github.com/nktice/AMD-AI/blob/main/flash-attention.md Version specific and with some odd requirements, so I wrote that down.

Loading TWG exllama wasn't using the new FA2 - so I wrote turboderp, to report these issues there, so they are aware. https://github.com/turboderp/exllamav2/issues/397

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.7.0 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.13
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 5950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 5950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3400                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65747228(0x3eb391c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65747228(0x3eb391c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65747228(0x3eb391c) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    [redacted]
  Marketing Name:          Radeon RX 7900 XTX                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2431                               
  BDFID:                   3072                               
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 202                                
  SDMA engine uCode::      20                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    [redacted]               
  Marketing Name:          Radeon RX 7900 XTX                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2431                               
  BDFID:                   3840                               
  Internal Node ID:        2                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 202                                
  SDMA engine uCode::      20                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Additional Information

No response

nktice commented 3 weeks ago

I tried this again, and thought that I would share this output from trying..

git clone --recurse-submodules https://github.com/ROCmSoftwarePlatform/flash-attention.git
cd flash-attention
python setup.py build | grep -i error
# output follows 
/home/n/flash-attention/csrc/composable_kernel/include/ck_tile/host/hip_check_error.hpp -> /home/n/flash-attention/csrc/composable_kernel/include/ck_tile/host/hip_check_error.hpp [skipped, no changes]
Successfully preprocessed all matching files.
/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/__init__.py:85: _DeprecatedInstaller: setuptools.installer and fetch_build_eggs are deprecated.
!!

        ********************************************************************************
        Requirements should be satisfied by a PEP 517 installer.
        If you are using pip, you can try `pip install --use-pep517`.
        ********************************************************************************

!!
  dist.fetch_build_eggs(dist.setup_requires)
Emitting ninja build file /home/n/flash-attention/build/temp.linux-x86_64-cpython-311/build.ninja...
Compiling objects...
Using envvar MAX_JOBS (6) as the number of workers...
/home/n/flash-attention/csrc/composable_kernel/include/ck_tile/core/arch/generic_memory_space_atomic_hip.hpp:66:19: error: static assertion failed due to requirement '(std::is_same<_Float16, int>::value && (4 == 1)) || (std::is_same<_Float16, unsigned int>::value && (4 == 1)) || (std::is_same<_Float16, float>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, double>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, unsigned short>::value && (4 == 2 || 4 == 4))': wrong! not implemented
1 error generated when compiling for gfx1100.
/home/n/flash-attention/csrc/composable_kernel/include/ck_tile/core/arch/generic_memory_space_atomic_hip.hpp:66:19: error: static assertion failed due to requirement '(std::is_same<_Float16, int>::value && (4 == 1)) || (std::is_same<_Float16, unsigned int>::value && (4 == 1)) || (std::is_same<_Float16, float>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, double>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, unsigned short>::value && (4 == 2 || 4 == 4))': wrong! not implemented
1 error generated when compiling for gfx1100.
/home/n/flash-attention/csrc/composable_kernel/include/ck_tile/core/arch/generic_memory_space_atomic_hip.hpp:66:19: error: static assertion failed due to requirement '(std::is_same<_Float16, int>::value && (4 == 1)) || (std::is_same<_Float16, unsigned int>::value && (4 == 1)) || (std::is_same<_Float16, float>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, double>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, unsigned short>::value && (4 == 2 || 4 == 4))': wrong! not implemented
1 error generated when compiling for gfx1100.
/home/n/flash-attention/csrc/composable_kernel/include/ck_tile/core/arch/generic_memory_space_atomic_hip.hpp:66:19: error: static assertion failed due to requirement '(std::is_same<_Float16, int>::value && (4 == 1)) || (std::is_same<_Float16, unsigned int>::value && (4 == 1)) || (std::is_same<_Float16, float>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, double>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, unsigned short>::value && (4 == 2 || 4 == 4))': wrong! not implemented
1 error generated when compiling for gfx1100.
/home/n/flash-attention/csrc/composable_kernel/include/ck_tile/core/arch/generic_memory_space_atomic_hip.hpp:66:19: error: static assertion failed due to requirement '(std::is_same<_Float16, int>::value && (4 == 1)) || (std::is_same<_Float16, unsigned int>::value && (4 == 1)) || (std::is_same<_Float16, float>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, double>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, unsigned short>::value && (4 == 2 || 4 == 4))': wrong! not implemented
1 error generated when compiling for gfx1100.
/home/n/flash-attention/csrc/composable_kernel/include/ck_tile/core/arch/generic_memory_space_atomic_hip.hpp:66:19: error: static assertion failed due to requirement '(std::is_same<_Float16, int>::value && (4 == 1)) || (std::is_same<_Float16, unsigned int>::value && (4 == 1)) || (std::is_same<_Float16, float>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, double>::value && (4 == 1 || 4 == 2)) || (std::is_same<_Float16, unsigned short>::value && (4 == 2 || 4 == 4))': wrong! not implemented
1 error generated when compiling for gfx1100.
Traceback (most recent call last):
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
    subprocess.run(
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '6']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/n/flash-attention/setup.py", line 490, in <module>
    setup(
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/__init__.py", line 108, in setup
    return distutils.core.setup(**attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 184, in setup
    return run_commands(dist)
           ^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 200, in run_commands
    dist.run_commands()
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 970, in run_commands
    self.run_command(cmd)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/dist.py", line 945, in run_command
    super().run_command(command)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 989, in run_command
    cmd_obj.run()
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/command/build.py", line 135, in run
    self.run_command(cmd_name)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 316, in run_command
    self.distribution.run_command(command)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/dist.py", line 945, in run_command
    super().run_command(command)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 989, in run_command
    cmd_obj.run()
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 93, in run
    _build_ext.run(self)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
    self.build_extensions()
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 868, in build_extensions
    build_ext.build_extensions(self)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 479, in build_extensions
    self._build_extensions_serial()
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 505, in _build_extensions_serial
    self.build_extension(ext)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 254, in build_extension
    _build_ext.build_extension(self, ext)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/Cython/Distutils/build_ext.py", line 135, in build_extension
    super(build_ext, self).build_extension(ext)
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 560, in build_extension
    objects = self.compiler.compile(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 681, in unix_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1784, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "/home/n/miniconda3/envs/textgen/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

So I note errors in flash-attention/csrc/composable_kernel/include/ck_tile/core/arch/generic_memory_space_atomic_hip.hpp:66:19 where something's missing for my architecture ( gfx1100 )