Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

instruction scheduler is overly pessimistic about instruction availability #10514

Open Quuxplusone opened 12 years ago

Quuxplusone commented 12 years ago
Bugzilla Link PR11589
Status NEW
Importance P enhancement
Reported by Hal Finkel (hfinkel@anl.gov)
Reported on 2011-12-15 13:58:28 -0800
Last modified on 2011-12-19 23:58:11 -0800
Version trunk
Hardware PC All
CC atrick@apple.com, efriedma@quicinc.com, evan.cheng@apple.com, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments tsc_s000.ll (9931 bytes, application/octet-stream)
Blocks
Blocked by
See also
Created attachment 7748
Test case

The scoreboard hazard detector that I've added for the PPC 440 is not
detecting hazards as it should (which certainly could be my fault
somehow, but...). For example, it will produce a schedule that looks
like...

SU(28): 0x127969b0: f64,ch = LFD 0x12793aa0, 0x1277b4f0,
0x127965b0<Mem:LD8[%scevgep100](tbaa=!"double")> [ORD=41] [ID=28]
SU(46): 0x12796ab0: f64 = FADD 0x127969b0, 0x127968b0 [ORD=42] [ID=46]
SU(27): 0x12796cb0: ch = STFD 0x12796ab0, 0x12793aa0, 0x1277b3f0,
0x127969b0:1<Mem:ST8[%scevgep103](tbaa=!"double")> [ORD=46] [ID=27]
SU(26): 0x127970b0: f64,ch = LFD 0x127941a0, 0x1277b4f0,
0x12796cb0<Mem:LD8[%scevgep94](align=16)(tbaa=!"double")> [ORD=50]
[ID=26]
SU(47): 0x127972c0: f64 = FADD 0x127970b0, 0x127968b0 [ORD=51] [ID=47]
SU(25): 0x127974c0: ch = STFD 0x127972c0, 0x127941a0, 0x1277b3f0,
0x127970b0:1<Mem:ST8[%scevgep97](align=16)(tbaa=!"double")> [ORD=55]
[ID=25]

in other words, it produces a set of load, add, store triples,
non-interleaved, in order. The problem is that the result of the load is
not immediately available, and either is the result of the add. The
loads are covered by the itinerary:

  InstrItinData<LdStLFD     , [InstrStage<1, [IFTH1, IFTH2]>,
                               InstrStage<1, [PDCD1, PDCD2]>,
                               InstrStage<1, [DISS1, DISS2]>,
                               InstrStage<1, [LRACC]>,
                               InstrStage<1, [AGEN]>,
                               InstrStage<1, [CRD]>,
                               InstrStage<2, [LWB]>],
                              [9, 5, 5],
                              [NoBypass, GPR_Bypass, GPR_Bypass]>,

the add is covered by the itinerary:

  InstrItinData<FPGeneral   , [InstrStage<1, [IFTH1, IFTH2]>,
                               InstrStage<1, [PDCD1, PDCD2]>,
                               InstrStage<1, [DISS1, DISS2]>,
                               InstrStage<1, [FRACC]>,
                               InstrStage<1, [FEXE1]>,
                               InstrStage<1, [FEXE2]>,
                               InstrStage<1, [FEXE3]>,
                               InstrStage<1, [FEXE4]>,
                               InstrStage<1, [FEXE5]>,
                               InstrStage<1, [FEXE6]>,
                               InstrStage<1, [FWB]>],
                              [10, 4, 4],
                              [FPR_Bypass, FPR_Bypass, FPR_Bypass]>,

the store is covered by:

  InstrItinData<LdStUX      , [InstrStage<1, [IFTH1, IFTH2]>,
                               InstrStage<1, [PDCD1, PDCD2]>,
                               InstrStage<1, [DISS1, DISS2]>,
                               InstrStage<1, [LRACC]>,
                               InstrStage<1, [AGEN]>,
                               InstrStage<1, [CRD]>,
                               InstrStage<1, [LWB]>],
                              [8, 5, 5],
                              [NoBypass, GPR_Bypass, GPR_Bypass]>,

So, say that the load dispatches in cycle 1. According to the itinerary,
the result of the load is not available until cycle 9. The add
dispatches in the same cycle, or one cycle later. In the best case, it
dispatches one cycle later (in cycle 2). It expects to read its inputs 4
cycles later in cycle number 6. The input, however, will not be
available until cycle 9 yielding a 3 cycle stall. As far as I can tell
by looking at the debug output, no hazard was reported by the scoreboard
detector. Is this a bug or am I doing something wrong?

I've attached a small test case, run with llc -mcpu=440
Quuxplusone commented 12 years ago

Attached tsc_s000.ll (9931 bytes, application/octet-stream): Test case

Quuxplusone commented 12 years ago
Looking into this more, it seems that the problem is that the scheduler never
has more than one instruction from which to choose.

The instruction stream is essentially a series of:
  %inc15 = or i32 %i.013, 1
  %arrayidx.1 = getelementptr inbounds [16000 x double]* @Y, i32 0, i32 %inc15
  %1 = load double* %arrayidx.1, align 8, !tbaa !0
  %add.1 = fadd double %1, 1.000000e+00
  %arrayidx5.1 = getelementptr inbounds [16000 x double]* @X, i32 0, i32 %inc15
  store double %add.1, double* %arrayidx5.1, align 8, !tbaa !0
  %inc.116 = or i32 %i.013, 2
  %arrayidx.2 = getelementptr inbounds [16000 x double]* @Y, i32 0, i32 %inc.116
  %2 = load double* %arrayidx.2, align 16, !tbaa !0
  %add.2 = fadd double %2, 1.000000e+00
  %arrayidx5.2 = getelementptr inbounds [16000 x double]* @X, i32 0, i32 %inc.116
  store double %add.2, double* %arrayidx5.2, align 16, !tbaa !0

And during scheduling we have:
Examining Available:
Examining Available:
Height 39: SU(9): 0x34c4db0: ch = STFD 0x34c49b0, 0x34c6ad0, 0x34a2bf0,
0x34c48b0:1<Mem:ST8[%scevgep18](align=16)(tbaa=!"double")> [ORD=126] [ID=9]

*** Scheduling [39]: SU(9): 0x34c4db0: ch = STFD 0x34c49b0, 0x34c6ad0,
0x34a2bf0, 0x34c48b0:1<Mem:ST8[%scevgep18](align=16)(tbaa=!"double")> [ORD=126]
[ID=9]
...
Examining Available:
Height 42: SU(55): 0x34c49b0: f64 = FADD 0x34c48b0, 0x34bb4f0 [ORD=122] [ID=55]

*** Scheduling [42]: SU(55): 0x34c49b0: f64 = FADD 0x34c48b0, 0x34bb4f0
[ORD=122] [ID=55]
...
Examining Available:
Height 48: SU(10): 0x34c48b0: f64,ch = LFD 0x34c6ad0, 0x34a2cf0,
0x34c46b0<Mem:LD8[%scevgep21](align=16)(tbaa=!"double")> [ORD=121] [ID=10]

*** Scheduling [48]: SU(10): 0x34c48b0: f64,ch = LFD 0x34c6ad0, 0x34a2cf0,
0x34c46b0<Mem:LD8[%scevgep21](align=16)(tbaa=!"double")> [ORD=121] [ID=10]
...

Is this an alias-analysis problem?
Quuxplusone commented 12 years ago
Is this caused by the TODO in the following comment in
CodeGen/ScheduleDAGInstrs.cpp (BuildSchedGraph)?

    // Add chain dependencies.
    // Chain dependencies used to enforce memory order should have
    // latency of 0 (except for true dependency of Store followed by
    // aliased Load... we estimate that with a single cycle of latency
    // assuming the hardware will bypass)
    // Note that isStoreToStackSlot and isLoadFromStackSLot are not usable
    // after stack slots are lowered to actual addresses.
    // TODO: Use an AliasAnalysis and do real alias-analysis queries, and
    // produce more precise dependence information.
Quuxplusone commented 12 years ago

It seems that the answer to that is no; that piece of code is not executed. The dependencies seem to be added because the SDNodes are like:

SU(6): 0x228add0: ch = STFD 0x228abd0, 0x228a7d0, 0x228d3a0, 0x228a9d0:1<Mem:ST8%arrayidx5(tbaa=!"double")> [ORD=6] [ID=6]

SU(15): 0x228ec70: i32 = LA 0x228ea70, 0x228e970 [ID=15]

SU(5): 0x228b1d0: f64,ch = LFD 0x228d4a0, 0x228ec70, 0x228add0<Mem:LD8%arrayidx.1> [ORD=9] [ID=5]

So the last operand of the load is the result of the store. Because the value type is MVT::Other, ScheduleDAGSDNodes::AddSchedEdges() considers this to be part of the critical chain. Why are the loads and stores (into independent arrays) connected in this way?

Quuxplusone commented 12 years ago
(In reply to comment #3)
> So the last operand of the load is the result of the store. Because the value
> type is MVT::Other, ScheduleDAGSDNodes::AddSchedEdges() considers this to be
> part of the critical chain. Why are the loads and stores (into independent
> arrays) connected in this way?

That's a known missed optimization.
Quuxplusone commented 12 years ago

I am working on a patch to SelectionDAGBuilder to correct this. I'll submit it for review soon.