Closed bobrobey closed 2 months ago
@bobrobey Apologies for the lack of response. Do you still need assistance with this ticket? Thanks!
I just tested it on ROCm 6.2.0
salloc -p MI250 -N 1 --gpus=1
rocminfo works rocm_agent_enumerator gives gfx90a
So I think it has been fixed
Looking at rocm_agent_enumerator from ROCm 6.2.0
if os.path.isfile(prop_path) and os.access(prop_path, os.R_OK):
It looks like it is checking
203 except PermissionError: 204 # We may have a subsystem (e.g. scheduler) limiting device visibility which 205 # could cause a permission error. 206 line = ''
It looks like it was fixed in ROCm 5.5.0
From 5.4.6:
189 def readFromKFD(): 190 target_list = [] 191 192 topology_dir = '/sys/class/kfd/kfd/topology/nodes/' 193 if os.path.isdir(topology_dir): 194 for node in sorted(os.listdir(topology_dir)): 195 node_path = os.path.join(topology_dir, node) 196 if os.path.isdir(node_path): 197 prop_path = node_path + '/properties' 198 if os.path.isfile(prop_path): 199 target_search_term = re.compile("gfx_target_version.+") 200 with open(prop_path) as f: 201 line = f.readline() 202 while line != '' : 203 search_result = target_search_term.search(line) 204 if search_result is not None: 205 device_id = int(search_result.group(0).split(' ')[1], 10) 206 if device_id != 0: 207 major_ver = int((device_id / 10000) % 100) 208 minor_ver = int((device_id / 100) % 100) 209 stepping_ver = int(device_id % 100) 210 target_list.append("gfx" + format(major_ver, 'd') + format(minor_ver, 'x') + format(stepping_ver, 'x')) 211 line = f.readline() 212 213 return target_list
From 5.5.0:
192 topology_dir = '/sys/class/kfd/kfd/topology/nodes/' 193 if os.path.isdir(topology_dir): 194 for node in sorted(os.listdir(topology_dir)): 195 node_path = os.path.join(topology_dir, node) 196 if os.path.isdir(node_path): 197 prop_path = node_path + '/properties' 198 if os.path.isfile(prop_path) and os.access(prop_path, os.R_OK): 199 target_search_term = re.compile("gfx_target_version.+") 200 with open(prop_path) as f: 201 try: 202 line = f.readline() 203 except PermissionError: 204 # We may have a subsystem (e.g. scheduler) limiting device visibility which 205 # could cause a permission error. 206 line = '' 207 while line != '' : 208 search_result = target_search_term.search(line) 209 if search_result is not None: 210 device_id = int(search_result.group(0).split(' ')[1], 10) 211 if device_id != 0: 212 major_ver = int((device_id / 10000) % 100) 213 minor_ver = int((device_id / 100) % 100) 214 stepping_ver = int(device_id % 100) 215 target_list.append("gfx" + format(major_ver, 'd') + format(minor_ver, 'x') + format(stepping_ver, 'x')) 216 line = f.readline() 217 218 return target_list
Note the addition of the except PermissionError
Thanks @bobrobey. Closing ticket as fixed.
SLURM will disable read access to additional GPUs that are not requested with salloc. So needs a check for read access to the 0, 1, 2, 3 files with os.access('joe.txt', os.R_OK) at about line 200