ROCm / ROCT-Thunk-Interface

ROCm's Thunk Interface
Other
83 stars 71 forks source link

Bug in hsaKmtCreateQueue #78

Closed misos1 closed 1 year ago

misos1 commented 1 year ago

When is hsaKmtCreateQueue called first time for node doorbells[NodeId].size is initialized to zero in init_process_doorbells but used here:

https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/026fae434a141faa10da109a2c1b03dc9e06db3f/src/queues.c#L574-L577

Only after these lines is doorbells[NodeId].size initialized to the right value through map_doorbell in get_doorbell_map_info. It works just by accident because doorbells[NodeId].size is uint32_t so -1 will be 0xFFFFFFFF which is zero extended into 0x00000000FFFFFFFF and it will work as long as mmap offset bits are not within lower 32 bits. There should be used something like DOORBELLS_PAGE_SIZE(DOORBELL_SIZE(q->gfxv)) - 1 instead of doorbells[NodeId].size - 1.

Also why KFD cannot return a doorbell pointer directly? Why must there be this additional mapping of some weird mmap offset?

fxkamd commented 1 year ago

Thanks for pointing out this issue.

The reason that KFD cannot return a doorbell pointer is, because it doesn't control virtual address mappings. The virtual address space in ROCm is a shared virtual address space between the CPU and all GPUs. It's managed in user mode using the mmap system call. When the queue is created, KFD only decides the offset of this queue within the per-process doorbell map. Those are the lower bits of the doorbell address.

The mapping of doorbells is delayed to the first queue creation because that avoids allocating doorbells unnecessarily for processes that don't use any queues. This is particularly important on multi-GPU systems where we don't want every process to allocate doorbells on all GPUs when they are only using one of the GPUs.

There is a bit of a chicken and egg problem for the first queue. I think the easiest fix is to call get_doorbell_map_info before calculating the doorbell_mmap_offset.

misos1 commented 1 year ago

I think the easiest fix is to call get_doorbell_map_info before calculating the doorbell_mmap_offset.

That would not play well with this check: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/026fae434a141faa10da109a2c1b03dc9e06db3f/src/queues.c#L217

There is a bit of a chicken and egg problem for the first queue.

This size is actually a constant based on the gpu version which is known at hsaKmtCreateQueue. What would be wrong with just DOORBELLS_PAGE_SIZE(DOORBELL_SIZE(q->gfxv))?

fxkamd commented 1 year ago

That check is there to prevent mapping the doorbells twice. It could be replaced with a check for doorbells[NodeId].mapping. Using DOORBELLS_PAGE_SIZE would work too.

misos1 commented 1 year ago

Yes, of course. But you would need to set doorbells[NodeId].size with locked mutex doorbells[NodeId].mutex to avoid UB. Actually as I am thinking about that also accessing doorbells[NodeId].size here is UB: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/026fae434a141faa10da109a2c1b03dc9e06db3f/src/queues.c#L574-L577

kentrussell commented 1 year ago

Confirmed to be resolved in ROCm 5.5 (next release).

misos1 commented 1 year ago

Do you know when ROCM 5.5 will be released?

kentrussell commented 1 year ago

I'm not a PM, so I'm not allowed to say the planned release dates. But I confirmed that Felix's patch was part of the 5.5 branch that is currently undergoing testing. So you won't be waiting too long.