ROCm / rocm-install-on-linux

https://rocm.docs.amd.com/projects/install-on-linux/en/latest/
MIT License
12 stars 10 forks source link

[Issue]: Memory Not Polling #324

Closed MarkCogan closed 1 month ago

MarkCogan commented 1 month ago

Problem Description

Driver version: 60140092 Err (Properties): no error Device name: AMD Instinct MI210 Device arch: gfx90a:sramecc+:xnack- Device pciBusID: 47 Device pciDeviceID: 0 Err (Count): no error Device count: 8 Err (Mem): invalid argument. <-- Total Mem: 20140771 Avail Mem: 140736205064816

Not returning value from hipMemGetInfo(&freeMem, &totalMem)

Operating System

Rocky Linux release 8.7 (Green Obsidian)

CPU

AMD EPYC 7713 64-Core Processor

GPU

AMD Instinct MI210

ROCm Version

ROCm 6.1.0

ROCm Component

HIP, ROCm

Steps to Reproduce

Unable to determine if we're still missing a component / package. System is air-gapped and running a local repository for amdgpu and rocm_6.1. Comparing to other systems is not feasible.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

=====================
HSA System Attributes
=====================
Runtime Version: 1.13 Runtime Ext Version: 1.4 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED DMAbuf Support: NO

==========
HSA Agents
==========


Agent 1


Name: AMD EPYC 7713 64-Core Processor
Uuid: CPU-XX
Marketing Name: AMD EPYC 7713 64-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3720
BDFID: 0
Internal Node ID: 0
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32783820(0x1f43dcc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 32783820(0x1f43dcc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32783820(0x1f43dcc) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:

******* Agent 16 ******* Name: gfx90a Uuid: GPU-3dbcb8ccc0d65682 Marketing Name: AMD Instinct MI210 Vendor Name: AMD Feature: KERNEL_DISPATCH Profile: BASE_PROFILE Float Round Mode: NEAR Max Queue Number: 128(0x80) Queue Min Size: 64(0x40) Queue Max Size: 131072(0x20000) Queue Type: MULTI Node: 15 Device Type: GPU Cache Info: L1: 16(0x10) KB L2: 8192(0x2000) KB Chip ID: 29711(0x740f) ASIC Revision: 1(0x1) Cacheline Size: 64(0x40) Max Clock Freq. (MHz): 1700 BDFID: 52992 Internal Node ID: 15 Compute Unit: 104 SIMDs per CU: 4 Shader Engines: 8 Shader Arrs. per Eng.: 1 WatchPts on Addr. Ranges:4 Coherent Host Access: FALSE Features: KERNEL_DISPATCH Fast F16 Operation: TRUE Wavefront Size: 64(0x40) Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Max Waves Per CU: 32(0x20) Max Work-item Per CU: 2048(0x800) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) Max fbarriers/Workgrp: 32 Packet Processor uCode:: 63 SDMA engine uCode:: 8 IOMMU Support:: None Pool Info: Pool 1 Segment: GLOBAL; FLAGS: COARSE GRAINED Size: 67092480(0x3ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 2 Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 67092480(0x3ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 3 Segment: GLOBAL; FLAGS: FINE GRAINED Size: 67092480(0x3ffc000) KB Allocatable: TRUE Alloc Granule: 4KB Alloc Recommended Granule:2048KB Alloc Alignment: 4KB Accessible by all: FALSE Pool 4 Segment: GROUP Size: 64(0x40) KB Allocatable: FALSE Alloc Granule: 0KB Alloc Recommended Granule:0KB Alloc Alignment: 0KB Accessible by all: FALSE ISA Info: ISA 1 Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack- Machine Models: HSA_MACHINE_MODEL_LARGE Profiles: HSA_PROFILE_BASE Default Rounding Mode: NEAR Default Rounding Mode: NEAR Fast f16: TRUE Workgroup Max Size: 1024(0x400) Workgroup Max Size per Dimension: x 1024(0x400) y 1024(0x400) z 1024(0x400) Grid Max Size: 4294967295(0xffffffff) Grid Max Size per Dimension: x 4294967295(0xffffffff) y 4294967295(0xffffffff) z 4294967295(0xffffffff) FBarrier Max Size: 32 *** Done *** ### Additional Information _No response_
MarkCogan commented 1 month ago

Any help with determining a missing RPM or component would be helpful. Pretty sure that's our issue, but since we can't do an online install, we're flying a hair blind trying to compare to another system. These are HPC systems, so we got a lot going on.

harkgill-amd commented 1 month ago

Hi @MarkCogan, an internal ticket has been created to further investigate this issue. In the future, please report your issues under the ROCm/ROCm repository as it is more frequently monitored.

MarkCogan commented 1 month ago

I think we can close this issue. Part of it was that we needed to fully update the OS / Kernel (among other factors). It seems that going with the latest OS and ROCm repository solved the issue for us. My users are seeing the memory polled as expected now. My sticking point I think was that on install of ROCm 6 there was no warning about the OS Kernel or possible compatibility issues.