History Domination: bug with PERP

Quincunx271 commented 4 years ago

When running cactus with PERP and APPLY_HISTORY_DOMINATION set to YES, the ApplyBndRobin:63 block has a spill cost of 4, when it should be 2.

sched.ini file follows. I set ApplyBndRobin as the only hot function so that it wouldn't waste time compiling the rest of the benchmark:

# Use optimizing scheduling
# YES
# NO : No scheduling is done.
# HOT_ONLY: Only use scheduler with hot functions.
USE_OPT_SCHED HOT_ONLY

# Print spill counts
# Same options as use optimal scheduling.
PRINT_SPILL_COUNTS YES

# Use two pass scheduling approach. 
# First pass minimizes RP and second pass tries to balances RP and ILP.
# YES
# NO
USE_TWO_PASS NO

# These 3 flags control which schedulers will be used.
# Each one can be individually toggled. The heuristic
# list scheduler or ACO must be run before the 
# enumerator.
# VALUES:
# YES
# NO
# HEUR_ENABLED is the Heuristic scheduler.
HEUR_ENABLED YES
# ACO_ENABLED is the Ant Colony Optimization scheduler.
ACO_ENABLED NO
# ENUM_ENABLED is the Branch and Bound scheduler.
ENUM_ENABLED YES

# Controls when ACO should be run, either before or
# after the enumerator. Both can be enabled at the
# same time. ACO is disabled if both are disabled.
# Run ACO before the enumerator
# VALUES:
# YES
# NO
ACO_BEFORE_ENUM YES
# Run ACO after the enumerator
ACO_AFTER_ENUM NO

# A time limit for the whole region (basic block) in milliseconds. Defaults to no limit.
# Interpretation depends on the TIMEOUT_PER setting.
# Not used when the two pass scheduling approach is enabled.
REGION_TIMEOUT 5

# A time limit for each schedule length in milliseconds. Defaults to no limit.
# Interpretation depends on the TIMEOUT_PER setting.
# Not used when the two pass scheduling approach is enabled.
LENGTH_TIMEOUT 5

# A time limit for the whole region in milliseconds. Defaults to no limit.
# Only used when two pass scheduling is enabled.
# A time limit for the whole region.
FIRST_PASS_REGION_TIMEOUT 5
# A time limit for each schedule length.
FIRST_PASS_LENGTH_TIMEOUT 5

# A time limit for the second pass in milliseconds.
# Only used when two pass scheduling is enabled.
# A time limit for the whole region.
SECOND_PASS_REGION_TIMEOUT 5
# A time limit for each schedule length.
SECOND_PASS_LENGTH_TIMEOUT 5

# How to interpret the timeout value? Valid options:
# INSTR : multiply the time limits in the above fields by the number of
# instructions in the block
# BLOCK : use the time limits in the above fields as is
TIMEOUT_PER INSTR

# The heuristic used for the list scheduler. Valid values are any combination of:
# CP: critical path
# LUC: last use count
# UC: use count
# SC: successor count
# NID: node ID
# LLVM: LLVM’s default list scheduler order
# Example: LUC_CP_NID
HEURISTIC LUC_CP_NID

# The heuristic used for the enumerator. If the two pass scheduling
# approach is enabled, then this value will be used for the first pass.
# Same valid values as HEURISTIC.
ENUM_HEURISTIC LUC_CP_NID

# The heuuristic used for the enumerator in the second pass in the two-pass scheduling approach.
# Same valid values as HEURISTIC.
SECOND_PASS_ENUM_HEURISTIC LUC_CP_NID

# The spill cost function to be used. Valid values are:
# PERP: peak excess reg pressure
# PRP: peak reg pressure
# SUM: sum of excess reg pressures across the block
# PEAK_PLUS_AVG: peak excess reg pressure plus the avg reg pressure across the block
# SLIL: sum of live interval lengths for each block
# SPILLS: number of spills after running a register allocator (doesn't work with enumerator)
# TARGET: use target specific register pressure tracking
SPILL_COST_FUNCTION PERP

# The weight of the spill cost in the objective function. This factor
# defines the importance of spill cost relative to schedule length. A good
# value for this factor should be found experimentally, but is is expected
# to be large on architectures with hardware scheduling like x86 (thus
# making spill cost minimization the primary objective) and smaller on
# architectures with in-order execution like SPARC (thus making scheduling
# the primary objective).
SPILL_COST_WEIGHT 10000

# Precision of latency info:
# PRECISE: use precise latencies from the machine_model.cfg file
# LLVM: use latencies from LLVM
# UNIT: use unit latencies
LATENCY_PRECISION UNIT

# The scheduler used to find an initial feasible schedule.
# LIST: List scheduler
# SEQ: Sequential list scheduler
HEUR_SCHED_TYPE LIST

#use 3-tournament
ACO_TOURNAMENT YES

#use fixd value for bias or not. If not, use ratio instaed
ACO_USE_FIXED_BIAS YES

#Fixed number of evaporation
ACO_FIXED_BIAS 10

# 0 to 1, ratio that will use bias
ACO_BIAS_RATIO 0.9

ACO_LOCAL_DECAY 0.1

ACO_DECAY_FACTOR 0.1

ACO_ANT_PER_ITERATION 10

ACO_TRACE NO

# The importance of the heuristic in ACO. ACO uses (1/heuristic)^importance, so
# importance of 0 means don't use the heuristic.
ACO_HEURISTIC_IMPORTANCE 1

# ACO will stop after this many iterations with no improvement.
ACO_STOP_ITERATIONS 50

# Whether LLVM mutations should be applyed to the DAG.
LLVM_MUTATIONS NO

# (Chris) If using the SLIL cost function, enabling this option
# will force the B&B scheduler to skip DAGs with zero PERP.
FILTER_BY_PERP NO

# If a register type has a MAX pressure below a certain threshold it is ignored.
FILTER_REGISTERS_TYPES_WITH_LOW_PRP NO

# (Chris) This setting chooses which blocks to keep and which blocks to discard.
# The scheduler will fall back to LLVM if the block is discarded.
# Valid options:
#   ALL: always take the block
#   IMPROVED: only take improved blocks
#   OPTIMAL: only take optimal blocks
#   IMPROVED_OR_OPTIMAL: only take improved or optimal blocks
#   ZERO_COST: only take zero-cost blocks
BLOCKS_TO_KEEP ALL

# (Chris) Override USE_OPT_SCHED to apply to specific regions.
# When SCHEDULE_SPECIFIC_REGIONS is set to YES, the scheduler
# will only schedule the regions specified by
# REGIONS_TO_SCHEDULE, which is a comma-separated list of
# scheduling regions.
SCHEDULE_SPECIFIC_REGIONS NO
REGIONS_TO_SCHEDULE fft1D_512:114

# Whether to use suffix concatenation. Disabled automatically if
# history domination is disabled.
ENABLE_SUFFIX_CONCATENATION NO

# Whether to apply the node superiority graph transformation.
STATIC_NODE_SUPERIORITY NO

# Whether to apply node superiority in multiple passes.
MULTI_PASS_NODE_SUPERIORITY NO

# Whether to apply relaxed pruning. Defaults to YES.
APPLY_RELAXED_PRUNING YES

# Whether to apply spill-cost pruning. Defaults to YES.
APPLY_SPILL_COST_PRUNING YES

# Whether to apply history-based domination. Defaults to YES.
APPLY_HISTORY_DOMINATION YES

# Use simple register types. In the machine scheduler this means
# use the first PSet associated with a RegUnit.
USE_SIMPLE_REGISTER_TYPES NO

# Should we simulate register allocation to the evaluate the effect
# of scheduling decisions on estimated spill count.
# BEST: Only simulate RA with the best (lowest cost) schedule.
# LIST: Only simulate RA with the list schedule.
# BOTH: Simulate RA using the best schedule and the list schedule.
# TAKE_SCHED_WITH_LEAST_SPILLS: Simulate RA using the best schedule and the list schedule, and
# take the schedule that generates the least spills.
# NO: Do not simulate register allocation.
SIMULATE_REGISTER_ALLOCATION BEST

# Should we ignore ilp and only schedule for register pressure.
SCHEDULE_FOR_RP_ONLY NO

# Whether to enumerate schedules containing stalls (no-op instructions).
# In certain cases, such as having unpipelined instructions, this may
# result in a better schedule. Defaults to YES.
ENUMERATE_STALLS YES

# Whether to generate missing parts of the machine model using information from LLVM.
# Requires a generator class for the target in machine_model.cfg.
GENERATE_MACHINE_MODEL NO

#The algorithm to use for determining the lower bound. Valid values are:
# RJ: Rim and Jain's algorithm.
# LC: Langevin and Cerny's algorithm.
# Defaults to LC.
LB_ALG LC

# Whether to verify that calculated schedules are optimal. Defaults to NO.
VERIFY_SCHEDULE YES

# Whether to apply dynamic node superiority. Defaults to NO.
DYNAMIC_NODE_SUPERIORITY NO

# An option to treat data dependencies of type ORDER as data dependencies.
TREAT_ORDER_DEPS_AS_DATA_DEPS NO

# The number of bits in the hash table used in history-based domination.
HIST_TABLE_HASH_BITS 16

The relevant part of the log (actual):

266647 INFO: ********** Opt Scheduling ********** (Time = 0 ms)
266648 INFO: --------------------------------------------------------------------------- (Time = 0 ms)
266649 INFO: Processing DAG ApplyBndRobin:63 with 43 insts and max latency 1. (Time = 0 ms)
266650 INFO: Lower bound of cost before scheduling: 430 (Time = 0 ms)
266651 INFO: The list schedule is of length 43 and spill cost 6. Tot cost = 60000 (Time = 0 ms)
266652 INFO: Enumerating at target length 43 (Time = 0 ms)
266653 INFO: Found a feasible sched. of length 43, spill cost 4 and tot cost 40000 (Time = 0 ms)
266654 INFO: DAG solved optimally in 39 ms with length=43, spill cost = 4, tot cost = 40000, cost imp=20000. (Time = 39 ms)
266655 INFO: Best schedule for DAG ApplyBndRobin:63 has cost 40000 and length 43. The schedule is optimal (Time = 39 ms)
266656 INFO: OPT_SCHED LOCAL RA: DAG Name: ApplyBndRobin:63 Number of spills: 5 (Time = 39 ms)
266657 INFO: Number of stores 5 (Time = 39 ms)
266658 INFO: Number of loads 0 (Time = 39 ms)
266659 INFO: Schedule verified successfully (Time = 39 ms)

The relevant part of the log after history domination is turned off (expected):

268587 INFO: --------------------------------------------------------------------------- (Time = 0 ms)
268588 INFO: Processing DAG ApplyBndRobin:63 with 43 insts and max latency 1. (Time = 0 ms)
268589 INFO: Lower bound of cost before scheduling: 430 (Time = 0 ms)
268590 INFO: The list schedule is of length 43 and spill cost 6. Tot cost = 60000 (Time = 0 ms)
268591 INFO: Enumerating at target length 43 (Time = 0 ms)
268592 INFO: Found a feasible sched. of length 43, spill cost 4 and tot cost 40000 (Time = 0 ms)
268593 INFO: Found a feasible sched. of length 43, spill cost 2 and tot cost 20000 (Time = 0 ms)
268594 INFO: DAG solved optimally in 70 ms with length=43, spill cost = 2, tot cost = 20000, cost imp=40000. (Time = 70 ms)
268595 INFO: Best schedule for DAG ApplyBndRobin:63 has cost 20000 and length 43. The schedule is optimal (Time = 70 ms)
268596 INFO: OPT_SCHED LOCAL RA: DAG Name: ApplyBndRobin:63 Number of spills: 5 (Time = 71 ms)
268597 INFO: Number of stores 5 (Time = 71 ms)
268598 INFO: Number of loads 0 (Time = 71 ms)
268599 INFO: Schedule verified successfully (Time = 71 ms)

vangthao95 commented 4 years ago

This bug also happens with the PRP spill cost function so there may be something wrong with history domination.

kerbowa commented 4 years ago

Probably has something to do with a898f8d75c74861888f5f3f3dc39dc068a89dfd6

You could try manually reverting that change. It would be nice to have compiler temps so that we can add a testcase for history domination once this is fixed. Also I wont be able to debug this myself without the IR.

Quincunx271 commented 4 years ago

Probably can't give compiler temps because of licensing for SPEC CPU2006. It may be worth replicating this with our own code.

CSUS-LLVM / OptSched

History Domination: bug with PERP #57