Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[AMDGPU] missing MRT export causes GPU hangs #42930

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR43960
Status NEW
Importance P enhancement
Reported by Samuel Pitoiset (samuel.pitoiset@gmail.com)
Reported on 2019-11-11 02:28:25 -0800
Last modified on 2020-01-14 08:56:29 -0800
Version trunk
Hardware PC Linux
CC carl.ritson@amd.com, cwabbott0@gmail.com, david.stuttard@amd.com, htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org, Matthew.Arsenault@amd.com, nhaehnle@gmail.com, tpr.ll@botech.co.uk
Fixed by commit(s)
Attachments while-inside-switch.log (9711 bytes, text/x-log)
Blocks
Blocked by
See also
Created attachment 22792
output

Hi,

dEQP-VK.graphicsfuzz.while-inside-switch hangs the GPU with RADV. According to
my investigation, it's because of a missing null export. See the 'output'
attachement, I think the compiler should emit 'exp null off, off, off, off done
vm' before 's_endpgm'.

FWIW, that test works with ACO.
Quuxplusone commented 5 years ago

Attached while-inside-switch.log (9711 bytes, text/x-log): output

Quuxplusone commented 5 years ago
Hi,

I believe with the example changing "br label %loop6" to "br label %endloop1"
will fix the output so the export is not skipped.
It is my understanding that the backend expects control flow convergences after
a kill for the export to be reached by all lanes.
So this could be a RADV issue?

As a related note, I have experimented with implementing a pass in the backend
to add missing null exports in control paths where no export done will be
visited; however the required analysis quickly becomes somewhat complex. So
while it is possible for the backend to handle this, I think it is probably
preferable that front-end (in this case RADV) simply ensures there will be an
export done in all control paths.

Thanks,

Carl
Quuxplusone commented 5 years ago
> I believe with the example changing "br label %loop6" to "br label
> %endloop1" will fix the output so the export is not skipped.
> It is my understanding that the backend expects control flow convergences
> after a kill for the export to be reached by all lanes.
> So this could be a RADV issue?

Changing "br label %loop6" to "br label %endloop1" should work yes. Although
this chunk of code [1] is totally dumb, it's correct and I think the backend
compiler should understand it and detect that the loop is actually not a loop?

It could be also be "fixed" at NIR level (the IR used by Mesa) but that looks
like a workaround to me. FWIW, ACO handles it correctly.

[1]
loop6:                                            ; preds = %endif2, %loop6
  call void @llvm.amdgcn.kill(i1 false) #3
  br label %loop6
Quuxplusone commented 5 years ago

I think the design of the kill intrinsic is broken. It should produce a boolean to branch to somewhere on, such as a block ending in unreachable.

Quuxplusone commented 4 years ago

Any plans to fix this?

Quuxplusone commented 4 years ago
(In reply to Samuel Pitoiset from comment #4)
> Any plans to fix this?

Can I convince all the users to migrate to a new intrinsic?
Quuxplusone commented 4 years ago
(In reply to Matt Arsenault from comment #5)
> (In reply to Samuel Pitoiset from comment #4)
> > Any plans to fix this?
>
> Can I convince all the users to migrate to a new intrinsic?

I think so, at least for Mesa.
Quuxplusone commented 4 years ago

I haven’t looked into the details of what’s required, but since we have the callbr instruction now, we should probably use it for kills. It avoids the problem of possibly having instructions between the kill and the terminator, and forces clients to make some sensible choice of where the control flow logically goes. I don’t think this requires a change to the intrinsic itself, but how it’s called. Additional work will be needed to have the control flow passes understand it, but I optimistically think it’s a manageable amount of work. This specific case shouldn’t really be much different than return, and we don’t really need to generally handle callbr.

Quuxplusone commented 4 years ago

I don't think the main problem is the discard itself. For what I understand, this shader hangs because BB18_8 doesn't contain a null export.

Quuxplusone commented 4 years ago

I've posted a patch for review https://reviews.llvm.org/D70781 which fixes the problem with this shader.