[AMDGPU] missing MRT export causes GPU hangs

Quuxplusone commented 5 years ago


Bugzilla Link	PR43960
Status	NEW
Importance	P enhancement
Reported by	Samuel Pitoiset (samuel.pitoiset@gmail.com)
Reported on	2019-11-11 02:28:25 -0800
Last modified on	2020-01-14 08:56:29 -0800
Version	trunk
Hardware	PC Linux
CC	carl.ritson@amd.com, cwabbott0@gmail.com, david.stuttard@amd.com, htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org, Matthew.Arsenault@amd.com, nhaehnle@gmail.com, tpr.ll@botech.co.uk
Fixed by commit(s)
Attachments	`while-inside-switch.log` (9711 bytes, text/x-log)
Blocks
Blocked by
See also

Created attachment 22792
output

Hi,

dEQP-VK.graphicsfuzz.while-inside-switch hangs the GPU with RADV. According to
my investigation, it's because of a missing null export. See the 'output'
attachement, I think the compiler should emit 'exp null off, off, off, off done
vm' before 's_endpgm'.

FWIW, that test works with ACO.

Quuxplusone commented 5 years ago

Attached while-inside-switch.log (9711 bytes, text/x-log): output

Quuxplusone commented 5 years ago

Hi,

I believe with the example changing "br label %loop6" to "br label %endloop1"
will fix the output so the export is not skipped.
It is my understanding that the backend expects control flow convergences after
a kill for the export to be reached by all lanes.
So this could be a RADV issue?

As a related note, I have experimented with implementing a pass in the backend
to add missing null exports in control paths where no export done will be
visited; however the required analysis quickly becomes somewhat complex. So
while it is possible for the backend to handle this, I think it is probably
preferable that front-end (in this case RADV) simply ensures there will be an
export done in all control paths.

Thanks,

Carl

Quuxplusone commented 5 years ago

> I believe with the example changing "br label %loop6" to "br label
> %endloop1" will fix the output so the export is not skipped.
> It is my understanding that the backend expects control flow convergences
> after a kill for the export to be reached by all lanes.
> So this could be a RADV issue?

Changing "br label %loop6" to "br label %endloop1" should work yes. Although
this chunk of code [1] is totally dumb, it's correct and I think the backend
compiler should understand it and detect that the loop is actually not a loop?

It could be also be "fixed" at NIR level (the IR used by Mesa) but that looks
like a workaround to me. FWIW, ACO handles it correctly.

[1]
loop6:                                            ; preds = %endif2, %loop6
  call void @llvm.amdgcn.kill(i1 false) #3
  br label %loop6

Quuxplusone commented 5 years ago

I think the design of the kill intrinsic is broken. It should produce a boolean to branch to somewhere on, such as a block ending in unreachable.