[Issue]: Crash during concurrent DMA operations "pthread_mutex_lock.c:94: ___pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed."

chris-barnes-at-etherform-com commented 1 month ago

Problem Description

Our application uses OpenCL (ROCm 6.1.2) and has multiple read and write operations that are called from different threads to enqueue DMAs between CPU and GPU memory spaces (depending on parameters it could be 10-15 threads trying to enqueue reads/writes). This has worked fine on previous ROCm builds but after updating to the most recent build we're getting a crash that says "pthread_mutex_lock.c:94: _pthread_mutex_lock: Assertion `mutex->data.__owner == 0' failed". We loaded up a debug build of libhsa-runtime64.so and we are seeing that the assertion is occurring here:

https://github.com/ROCm/ROCR-Runtime/blob/3ca6209de17a63d1c949852a314a99a3ff809e6e/src/core/runtime/amd_gpu_agent.cpp#L924

When the assertion is triggered we see that the mutex's "_owner" is equal to 0, so it does not make sense that the assertion is being called. We thought that there was some memory corruption occurring in our code but we've scrubbed it for several days now to try and find the issue and can't find anything.

It appears that this lock is new-ish code and is apparently not needed for our use case (I am not sure what a "sdma gang" is, and we do not appear to ever be using the case where "gang_factor" > 1).

My best guess is that sdma_gang_lock is getting corrupted somehow or that multiple threads are trying to call into the lock at once (and/or releasing it? somehow?) and that is causing __owner to not be zero (even though it looks like it is).

Since we're using OpenCL I don't know what code owns the GpuAgent object or possibly how it could be corrupted....

In any case we're kind of screwed at the moment so any help would be appreciated.

(Note, we have reproduced this issue on another machine with a 7900x CPU and W6800 GPU)

Operating System

Ubuntu 22.04

CPU

AMD 7900X

GPU

AMD Radeon RX 7900 XTX, AMD Radeon Pro W6800

Other

No response

ROCm Version

ROCm 6.0.0

ROCm Component

ROCR-Runtime

Steps to Reproduce

My best guess is to use a ROCm OpenCL program that has multiple threads enqueuing reads and writes to/from GPU memoryspace.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.7.0 is loaded =====================
HSA System Attributes
=====================
Runtime Version: 1.13 Runtime Ext Version: 1.4 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED DMAbuf Support: YES

==========
HSA Agents
==========

Agent 1

Name: AMD Ryzen 9 7900X 12-Core Processor Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 7900X 12-Core Processor Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5733
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 31965172(0x1e7bff4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 31965172(0x1e7bff4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 31965172(0x1e7bff4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:

Agent 2

Name: gfx1100
Uuid: GPU-4c9cc6eed449cab4
Marketing Name: Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2304
BDFID: 768
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 202
SDMA engine uCode:: 20
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32

Agent 3

Name: gfx1036
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 256(0x100) KB
Chip ID: 5710(0x164e)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2200
BDFID: 3584
Internal Node ID: 2
Compute Unit: 2
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 20
SDMA engine uCode:: 9
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1036
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done

Additional Information

No response

dayatsin-amd commented 1 month ago

Hi Chris, We have a patch that will be included in ROCm-6.2. It seems there is an issue with the Release() function of the mutex. Can you please check whether removing this line fixes it for you:

ScopedAcquire lock(&sdma_ganglock); - if (gang_factor == 1) sdma_ganglock.Release(); // Manage internal gang signals std::vector<core::Signal*> gang_signals; if (gang_factor > 1) {

chris-barnes-at-etherform-com commented 1 month ago

OK, we will run that test in about 10 minutes. We are currently running a test without the lock since it doesn't seem to be needed for us. :) The main issue is that we would need to have a special build of ROCm for our customer which is not ideal. What is the schedule for 6.2?

chris-barnes-at-etherform-com commented 1 month ago

That change appears to fix the problem (also removing the lock fixes the issue for whatever that's worth).

dayatsin-amd commented 1 month ago

Thank you for checking. The lock is necessary for multi threaded applications. It is valid for 2 separate threads to call the memory-copy functions at the same time.

chris-barnes-at-etherform-com commented 1 month ago

Oh, are you saying that the lock should not be released because it is necessary even in the case that "gang_factor == 1" (i.e., where it is released in the current 6.1.2 version)?

As a side comment: It seems that this makes 6.1.2 essentially unusable for a production application that is using > 1 thread (I imagine this is common..?), so it would seem like this should be in a hot fix or something...

ROCm / ROCR-Runtime