CodeGen crash with "Queue is empty!" on lshr i256 instruction

Quuxplusone commented 9 years ago


Bugzilla Link	PR22304
Status	RESOLVED FIXED
Importance	P normal
Reported by	Paweł Bylica (chfast@gmail.com)
Reported on	2015-01-23 03:44:01 -0800
Last modified on	2015-11-11 17:33:19 -0800
Version	trunk
Hardware	PC All
CC	atrick@apple.com, chfast@gmail.com, llvm-bugs@lists.llvm.org, quentin.colombet@gmail.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments	`lshr_codegen_bug_queue_is_empty.ll` (608 bytes, application/octet-stream) `lshr_bug_queue_is_empty.ll` (270 bytes, application/octet-stream) `codegen_crash.log` (159351 bytes, text/x-log) `bug_codegen_crash_ctlz_i256.ll` (261 bytes, application/octet-stream) `bug_codegen_crash_ctlz_i256_2.ll` (276 bytes, application/octet-stream) `bug_scheduledag4.ll` (279 bytes, application/octet-stream)
Blocks
Blocked by
See also	PR21184, PR19797, PR19594, PR23405, PR25498

Created attachment 13724
CodeGen test

Assertion failed: !Queue.empty() && "Queue is empty!", file
llvm\lib\CodeGen\SelectionDAG\ScheduleDAGRRList.cpp, line 1701
Stack dump:
0.      Program arguments: h:\Sources\libs\build\llvm-3.5.0-
x64\Debug\bin\llc.exe bugpoint-reduced-simplified.ll
1.      Running pass 'Function Pass Manager' on module 'bugpoint-reduced-
simplified.ll'.
2.      Running pass 'X86 DAG->DAG Instruction Selection' on function
'@lshr_codegen_bug_queue_is_empty'

It happens on Windows and Linux, LLVM 3.5.0. I will check current trunk.

I used bugpoint to reduce the test. I'm pretty sure the point of failure is
lshr i256 instruction because I reimplemented my algorithm without it and it
worked.

Quuxplusone commented 9 years ago

Confirmed with LLVM 3.7 SVN r226904.

Quuxplusone commented 9 years ago

Bug 21184 is another case of a larger-than-legal shift going wrong.

Both of these bugs appear to be manifestations of a bigger problem: LLVM doesn't know how to deal with jumbo types, but the documentation doesn't admit it. This is discussed in bug 19797.

Quuxplusone commented 9 years ago

FWIW, I also asked about this on the dev list recently:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-January/080875.html

Quuxplusone commented 9 years ago

I use a lot of i256 arithmetic in my library and it is not so bad. I needed to implement mul & div myself and avoid shift left but in general it works.

I would like to fix some bugs, but I'm not sure what the best workflow for that would be. I started by sending some tests to phabricator.

Quuxplusone commented 9 years ago

Attached lshr_codegen_bug_queue_is_empty.ll (608 bytes, application/octet-stream): CodeGen test

Quuxplusone commented 9 years ago

Attached lshr_bug_queue_is_empty.ll (270 bytes, application/octet-stream): CodeGen test

Quuxplusone commented 9 years ago

I reduced the test manually

Quuxplusone commented 9 years ago

(In reply to comment #4)
> I would like to fix some bugs, but I'm not sure what the best workflow for
> that would be. I started by sending some tests to phabricator.

Hi Paweł -

Thanks for reducing the test case. I haven't look at the details of IR
legalization, but if you want to fix the bug, I think you would look at
LegalizeIntegerTypes.cpp:
http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

Quuxplusone commented 9 years ago

Hi Sanjay,

I've spent some time in the debugger to trace down the problem. The failure happens in "register scheduling" what is black magic to me. The problem might be register related because if one more instruction from test is eliminated llc does the right work.

We know that IR is valid. Can we get any next lower level of abstraction that I could check instruction by instruction?

I can also try to trace the type legalization if you think the problem is there.

Quuxplusone commented 9 years ago

(In reply to comment #8)
> I've spent some time in the debugger to trace down the problem. The failure
> happens in "register scheduling" what is black magic to me. The problem
> might be register related because if one more instruction from test is
> eliminated llc does the right work.
>
> We know that IR is valid. Can we get any next lower level of abstraction
> that I could check instruction by instruction?

Ah - you've already made some progress. Great!

The machine level is also black magic to me. :)

So yes, it looks like legalization has already happened successfully, and we're
dying very near the end of compilation while optimizing the machine
instructions.

-debug shows:
*** Unscheduling [65]: SU(24): 0x7f8ae4803240: i64,i32 = ADC64ri8
0x7f8ae4055350, 0x7f8ae48054e0, 0x7f8ae48039b0:1 [ORD=2] [ID=24]

I don't have much experience here other than to look closely at the -debug
output from llc and then step through in the debugger. If you have questions,
you may try to contact Quentin Colombet on IRC or the mailing lists. He's
listed as the code owner of the register allocators.

Quuxplusone commented 9 years ago

To get specific logs you can use -debug-only=pre-RA-sched

Quuxplusone commented 9 years ago

Attached codegen_crash.log (159351 bytes, text/x-log): pre-RA-sched log and assert callstack

Quuxplusone commented 9 years ago

I attached a debug log. Quentin, can you check if there is anything suspicious in the log?

Quuxplusone commented 9 years ago

Attached bug_codegen_crash_ctlz_i256.ll (261 bytes, application/octet-stream): CodeGen test 2

Quuxplusone commented 9 years ago

Attached bug_codegen_crash_ctlz_i256_2.ll (276 bytes, application/octet-stream): CodeGen test 3

Quuxplusone commented 9 years ago

CC'ed Andy since he is the scheduler expert.

I haven't looked at the problem myself yet, but usually when we fail to schedule the dag, it is because we have to many constraints and particular ones that cannot be satisfied.

Pavel, are your DAGs cycles-free?

Quuxplusone commented 9 years ago

How can I check if it's cyclic-free? Can that be seen on isel graph?

Quuxplusone commented 9 years ago

(In reply to comment #16)
> How can I check if it's cyclic-free? Can that be seen on isel graph?

You should be able to use -view-sched-dags (or something like that).

Quuxplusone commented 9 years ago

Both isel and sched dags look like acyclic. I haven't found any edge going down.

Quuxplusone commented 9 years ago

I had a quick look at the produced DAG and my guess is that the scheduler
failed to schedule what touch the EFLAGS.
Indeed, a lot of instructions are writing to this with a lot of glue operators.

Andy knows this code much better than I do, so he will have a better diagnostic.

Quuxplusone commented 9 years ago

I've seen problems like this before with larger-than legal types. The
SelectionDAG scheduler has always had a very hard time with instructions that
write EFLAGS and have multiple users. This comes up when operations on large
types are split into smaller operations with carry-flag, like ADC.

The SelectionDAG scheduler has logic for cloning instructions to back itself
out of an unschedulable case. See CopyAndMoveSuccessors. But before cloning, it
tries to backtrack and add dependence edges in a way that the EFLAGS conflict
won't occur.
See DEBUG output "ARTIFICIAL edge from SU..."

The scheduler should check WillCreateCycle to avoid creating an unschedulable
graph, but it may be that the book-keeping for marking dependence edges as
scheduled is missing a case somewhere.

Forcing it to call into CopyAndMoveSuccessrs might be a way to brute force out
of this situation. It would be interesting to know if that solves it.

Otherwise you need to debug the trace from -debug-only=pre-RA-sched and keep
track of which dependence edges have been added and what nodes should be
"ready".

I hope that helps a little.

Quuxplusone commented 9 years ago

You were right, Andrew. Without backtracking tests compile. I also ran regression tests to check if nothing was broken. Everything seems to be fine. I will try to investigate the backtracking itself more.

Quuxplusone commented 9 years ago

Attached bug_scheduledag4.ll (279 bytes, application/octet-stream): CodeGen test 4

Quuxplusone commented 9 years ago

When the initial DAG is printed you see all the obvious dependencies between instructions. However, instructions that read/write EFLAGS don't have explicit dependencies. As the scheduler chooses instructions it tries to sort out the EFLAGS reader/writers. At that time, it may add new edges. To see the new edges that are added, you'll need to trace the scheduler, I think -debug-only=pre-RA-sched will do it.

Quuxplusone commented 9 years ago

Thanks for suggestions.

The following additional check fixes the problem:
http://reviews.llvm.org/D8556

Quuxplusone commented 9 years ago

Author: Andrew Trick <atrick@apple.com>
Date:   Thu Mar 26 20:44:13 2015

    Fix a bug in SelectionDAG scheduling backtracking code: PR22304.

    It can happen (by line CurSU->isPending = true; // This SU is not in
    AvailableQueue right now.) that a SUnit is mark as available but is
    not in the AvailableQueue. For SUnit being selected for scheduling
    both conditions must be met.

    This patch mainly defensively protects from invalid removing a node
    from a queue. Sometimes nodes are marked isAvailable but are not in
    the queue because they have been defered due to some hazard.

    Patch by Pawel Bylica!

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@233351 91177308-0d34-0410-b5e6-96231b3b80d8

Quuxplusone commented 9 years ago

The fix here probably also resolves the question I had a while back in:
bug 19594, comment 3

Quuxplusone / LLVMBugzillaTest

CodeGen crash with "Queue is empty!" on lshr i256 instruction #22303