Closed rgokulsm closed 4 years ago
Hi @rgokulsm,
The "InitializeBranchDecorator" is used for dynamic trace generation, in that pass we decorate the branches with a N/NT sequences. Then the code generation process can generate a dynamic trace as if the code was actually executed. We use such decoration whenever we interact with a trace driven simulator , which require the dynamic trace of executed instructions. Similarly, instructions accessing to memory can also be decorated, so that the dynamic memory access pattern can be reproduced.
Going to your question, the control of branch behavior is a two step process:
Thanks Ramon! Okay I guess I sort of understand, or this is my interpretation: All the branches can be BEQ (for instance) and they can compare a reserved register to the zero register (if it exists). Then we would probably need only 1 reserved registers per branch stream. The value the register could potentially be written to before every branch. The value written to the register can be, for example, a sequence: 0 0 0 1 0 0 0 1, thus creating a T T T NT T T T NT stream.
I've added a new example in the RISCV examples folder showing how to control the branch pattern behavior using a combination of other passes. The possible knobs to control the pattern are the following:
Not sure how 'smart' are the branch predictors implemented in your environment, but I hope that with these settings one can generate patterns with a given % of miss-predictions.
The CI/CD pipeline is currently running and I assume the repositories will be updated in the following hours.
Repository and pip packages updated.
In my previous comment, I forgot to mention that a particular pattern (taken or not taken) will be determined by the conditional branch being executed. The code generated in the example I've implemented compares always one of the 2 predefined registers (one set to zero and another set to one) against the current branch value (which can be 0 or 1) that is modified on each iteration. So, there is not need to force the usage of a particular type of branch (e.g. BEQ), although the pattern generated will be different for each type of comparison.
Thanks Ramon. I will test it out and update. In order to be able to tune the patterns on the fly (I will probably focus on the global pattern), I believe I will have to play with 'char' in the code below:
if char == "0":
global_pattern_regs.append(self.target.isa.registers["X0"])
else:
global_pattern_regs.append(self.target.isa.registers["X8"])
Just as FYI if someone interested, prior literature (https://users.elis.ugent.be/~leeckhou/papers/iiswc06-joshi.pdf) has suggested controlling a) % of branch directions i.e. T/NT and b) transition rate i.e. number of times the pattern switches between T and NT.
I guess I will try to add some randomized functions which can control 'char' based on the parameters above.
Good point. The example I coded provides the low-level interface, where you can provide "01" sequences for global and local patterns. You can build an abstraction on top of that, i.e. given a % of transition rate, it gets translated to "01" pattern accordingly (although there are multiple "01" patterns for a given transition rate ;) ). I assume you want such higher level 'knob' as it is more easy to control...
Hi @rbertran,
Apologies for leaving this idle for long. I have some thoughts / questions below but I'm not sure if they are correct, please take a look.
I was looking through the example code again and I'm a bit confused whether register dependency distance (which is configured by the parameter passed to register.DefaultRegisterAllocationPass) will not be applicable to the branches since the branch has both its operands read from reserved registers.
In other words, I think in order to have branches respect the dependency distance of 'dd', one of its registers (say r1) should have been written to by some other instruction, 'dd' instructions earlier.
Is that correct? If so, I believe this may possible:
some instruction writes local branch pattern value (in some reserved register to rx) to r1
.
.
.
<dd instructions later>
.
.
branch if r1 == ry (where ry is the global branch pattern value)
Maybe this is already happening I'm not sure... there is a 'andi x7, x5, 1' which I am unable to understand, which maybe is doing the above?
On a related note, I think x8 should also be a reserved register in the code below
# X5 will be used to store the local branch pattern
# X6 will be used to store the loop count
# X7 will be used to store current branch
# X8 will be used to store constant 1 (we already have
# X0 with constant 0)
reserved_registers = ["X5", "X6", "X7"]
Thanks, let me know if this makes sense!
Not problem @rgokulsm , these are very good observations.
The register dependency does not hold for the branches because the operands of the branch instructions are already defined (by other passes) before the register.DefaultRegisterAllocationPass
pass is executed. In general, most of the passes, do not modify existing code (created by previous passes) and it is up to the user to put the passes in the correct order to create the target code. This way we avoid situations were a pass changes the code/intentions of previous passes.
Not sure how to force the dependency distance always across all instructions. We can do what you suggested, i.e. write the required value to a given register some DD
instruction before the branch using instruction Y
. But then, the problem will be moved to instruction Y
instead of the branch (i.e. instruction Y
with hard-coded operands to generate the required pattern will not respect the dependency distance w.r.t. previous instructions). Also forcing instruction Y
to be of certain type, will affect the final instruction distribution (if you intend to mimic one). So, not sure how to avoid this issue in an easy way...
One thing I have used in the past (need to check the current implementation status) is to compute the final dependency distance and instruction distribution to get some feedback about how close the final generated code respects what has been specified. I can take a look on that and provide some instructions if you are interested in that checking mechanism. Maybe this way if you want an average DD=2
but because of the # of branches you get a real average DD=3
, you might then want to specify a lower DD
in your pass to get the target real DD=2
. ... it kind of a workaround, and it might increase the test case generation because you might need some iterations to achieve the target, it can help to control better the final properties of the generated code.
Regarding the X8
, good catch! I'll fix for the next round of repo updates.
Hello Ramon,
Thanks for the detailed reply. At the moment, I guess I will not get into checking the 'DD' value - maybe that would be something to think about later. I will try to achieve the required branch prediction rate, for which I should have all the controls (I will update on this once I check it).
I'm adding some more thoughts below, just from a discussion standpoint:
I guess at a high level, something to think about would be - should branch misprediction rate be the important factor or should branch misprediction penalty be the important factor (i.e. in building the synthetic workload).
I am not too clear on this but I feel that the time it takes to resolve the branch prediction should ideally be influenced by DD, and thus should influence the performance. If the branch is DD independent, even with higher misprediction rate, the penalty caused might be lesser since it can resolve quickly, thus resulting in lesser impact on performance.
I agree the penalty of miss-prediction is going to be hidden as most branches can be executed earlier (no dependencies). In that case, creating a dependency chain for the branch inputs as you pointed out will help, That said, depending on the aggressiveness of the OoO pipeline of a particular architecture, just adding 1 instruction dependency chain might not be enough, and we might need to add multiple chained instructions.
Hi Ramon,
Are you sure the following works as intended:
# At the end of each iteration, update the current
# branch register based on loop count
p = initialization.AddFinalizationAssemblyPass(
"srl x7, x5, x6"
)
synth.add_pass(p)
SRL:
Description: Shifts a register value right by the shift amount (shamt) and places the value in the destination register. Zeroes are shifted in. Operation: $d = $t >> h;
I think we want to right shift $t and then put only the right most bit of $tinto $d?
So maybe it should be
# At the end of each iteration, update the current
# branch register based on loop count
p = initialization.AddFinalizationAssemblyPass(
"srl x7, x5, x6 \n" +
"andi x7, x7, 1"
)
synth.add_pass(p)
Let me know what you think, thanks!
Uops! Yes I missed that. I'll update the code accordingly.
Repo updated!
Hi Ramon,
Thanks! I've been running some tests - and it feels like this pattern might be easy for the branch predictor to learn (though I have not methodically tried to stress test, this is my intuition).
Assume I use only BEQ and BNE, every iteration the value of x7 either becomes 0 or 1 according to the update at the end of the iteration. If it moves from 0 -> 0, then all predictions remain same as previous iteration. If it moves from 0 -> 1, then all predictions flip. And both of these cases can become obvious to the predictor just after the 1st branch in the iteration..
I believe that is the reason my branch accuracy is always 99+%-ish. My hypothesis is that the local pattern is not really adding any complexity to the prediction and the global pattern is learnt in 1-2 iterations. Even with other branches like BGE / BLT, there is similar to the above.
I think a different 'pass' for branch prediction could be the following:
use BLT or BGE. For 1st operand have a random value (say register r1). For 2nd operand (have random value in a tunable number of registers eg: r2-only or r2, r3, r4).
2nd operand register values remain fixed. the global aspect is tuned as a knob by how many of these unique registers are used.
1st operand register value can change after every iteration (using some logical operation like an xor). Whether it should change or not is controlled by another register r8, whose initial value can be tuned as a knob. This would add a tunable local pattern aspect.
Let me know what you think, we can discuss more today during the meeting. Thanks!!
It is not easy to trick branch predictors and that the same time control the branch the pattern ;)
For local branches, the pattern repeats every 64 executions of the branch (or lower if the 01 pattern provided is shorter). So, if local history tables are > 64, the pattern will be 'learnt'. With the 'switch' flag, this local branch pattern becomes 128 long . So if local history tables > 128, the pattern will be learnt eventually.
Similarly, for global pattern, it is N long, where N is the min(number of branches in a loop, min(length of the 01 pattern provided, 64)). Assuming we have many branches and a long pattern, it will have a worst case of 64 bits long. Eventually, if the GHT are >64 bits long, the pattern will be learnt.
As you suggest, we can reserve more registers to store longer branch patterns (64 -> 128 -> ... long patterns) that will stress the predictors. Another way is to generate capacity miss-predictions, i.e. include enough # of branches so that all the prediction patterns can not be store simultaneously.
Without changing current code, we can explore the reason why we get such high level prediction accuracy. Can you try various experiments with the different predictors (global, local) enabled disabled, as well as with different branch history sizes? This will help us understand what we need to do... my guess is that we probably need to larger patterns and more branches (bigger stressmarks) to really stress the predictors.
There is another factor to take into account... not sure how the architecture the fact that taken and not taken branches end up to the same instruction, so effectively , there should not be a miss-prediction penalty. That is something we need to work on.
I just want to follow up based on our discussion to make sure we are on the same page. I'm going to extend the RISCV branch-related example with the following characteristics (tunable knobs):
DefaultRegisterDependency
pass will set them according to it DD. (this has the risk that eventually the data values might converge or not change over time, making the branch predictable). I believe that the probability of randomness, we let us generate the right amount of miss-predictions. Let me know if I'm missing anything or if we need other parameters.
Hello Ramon,
1) I assume that randomization is unique per-branch (and not per-iteration)? I believe it should be so, else the first branch of the iteration might set the trend for the entire iteration.
2) In the previous branch pass/example, we were only comparing 0 and 1. i.e. the branch operands have 0 or 1. If the branch is supposed to be randomized, do we still use only 0 and 1, but just that the randomization can potentially cause one operand to be flipped?
3) Options for generating randomness: As you have mentioned, I think the first option is better, since it would be enforcing the randomness.
4) If what I've described in (2) above is correct then what we maybe need is a register with an initial random value which keeps rotating and say its LS bit (which would be 0 or 1) is copied to the branch operand.
Thanks!
To provide a heads up, I'm been updating the example and due to some bugs in my old installation of the risc-v toolchain (need to be updated), the validation is taking a bit longer.
So far, the following changes have been made:
Sounds great. I will test it out whenever available, thanks!
Also going back to my previous comment, I think that the randomizing of the branch might benefit from a) using larger values in the branch operands and not just '1' and '0' and b) Doing randomization per-branch and not just per-iteration.
Not sure if that makes things too complex though.
Repository and distribution updated. I've modified the branch example to provide examples of the new capabilities (BranchBraidNextPass
) and the new RandomizeBranchNext
pass. Support for adding extra randomization code has not been implemented yet, but based on my observations based only on random input data (note the new pass to initialize the register to random values) and ensuring the inputs of branches do not converge (add some random value at the end of each iteration), I was able to generate completely random patterns. I'll add the extra support needed, but I think that currently it is already possible to control the randomness of the branches.
Hey @rbertran - I am able to use both the passes but have a couple of questions wrt the randomization.
1)
ADDI x8, x0, 0x0
SLLI x8, x8, 32
LUI x31, 0
ADDIW x31, x31, 0x0
SLLI x31, x31, 8
ADDI x31, x31, 0xe3
SLLI x31, x31, 8
ADDI x31, x31, 0xd6
SLLI x31, x31, 8
ADDI x31, x31, 0xe4
SLLI x31, x31, 8
ADDI x31, x31, 0xb9
OR x8, x31, x8
SLLI x8, x8, 32
LUI x31, 0
ADDIW x31, x31, 0x0
SLLI x31, x31, 8
ADDI x31, x31, 0xd9
SLLI x31, x31, 8
ADDI x31, x31, 0x6e
SLLI x31, x31, 8
ADDI x31, x31, 0x18
SLLI x31, x31, 8
ADDI x31, x31, 0x2d
OR x8, x31, x8
ADDI x9, x0, 0x0
SLLI x9, x9, 32
LUI x31, 0
ADDIW x31, x31, 0x0
SLLI x31, x31, 8
ADDI x31, x31, 0xe3
SLLI x31, x31, 8
ADDI x31, x31, 0xd6
SLLI x31, x31, 8
ADDI x31, x31, 0xe4
SLLI x31, x31, 8
ADDI x31, x31, 0xb9
OR x9, x31, x9
SLLI x9, x9, 32
LUI x31, 0
ADDIW x31, x31, 0x0
SLLI x31, x31, 8
ADDI x31, x31, 0xd9
SLLI x31, x31, 8
ADDI x31, x31, 0x6e
SLLI x31, x31, 8
ADDI x31, x31, 0x18
SLLI x31, x31, 8
ADDI x31, x31, 0x2d
OR x9, x31, x9
This is a code snippet from the beginning of the code. Are all the registers (like x8 and x9) initialized with the same value? It seems like the same operations are being performed on x8 and x9 in the code snippet above.
If so, is that the goal or should all of them be initialized differently?
2)
ADDI x0, x0, 0x14d
ADDI x5, x5, 0x154
ADDI x6, x6, 0x1bc
ADDI x7, x7, 0x147
ADDI x8, x8, 0x5f
ADDI x9, x9, 0x24
ADDI x10, x10, 0x1b0
ADDI x11, x11, 0x17c
ADDI x12, x12, 0xf
ADDI x13, x13, 0x40
ADDI x14, x14, 0x15d
At the end of the iteration this addi is performed. I assume this is to change the value of the registers over iterations. Is that correct? If so, maybe this can be changed to some other bitwise operation? Since we are use BGE, with ADDI might just mean that for first N iterations it might be a Taken, and after that it might just be a Taken. (Though I am not sure if this is correct because I don't know exactly what happens to the 1st operand of the BGE or what changes happen to it over iterations).
Thanks!
Edit: The addi might also be restricted by the fact that the max addition is 7ff.
To add to the above, I used 100% BGE (i.e. N=1) and end up with less than 0.1% branch misprediction. I looked at the branch predictor's predictions. Interestingly over 90% of the branches are 'Taken'. Maybe this relates to my previous comment.
RNDINT()
as the initial value, which call RNDINT and the initializes all the registers to the same random value. The pass should provide a callable
object, i.e. RNDINT
(without parenthesis) so that each register is initialized with a different random value. I've updated the example accordingly. I'm polishing the code and the example, and the fixes will be committed in the following hours. The only remaining item, will be ensuring dependency distance. Now any necessary code (a single 'SLLI' instruction) is added at the beginning of the basic block containing the random branch.
Sounds awesome, will try it out once available. Thanks!
Repo and wheel packages updated. Try it out and let's see still have some issue pending.
branch.RandomizeByTypePass(
branch_instrs, # instructions to replace
self.target.isa.instructions['BGE_V0'], # new instruction
self.args.random_branch, # every N instructions
"slli @@BRREG@@, @@REG@@, @@COUNT@@", # randomization code
# to add before branch
distance=3, # distance between randomization code and branch
musage=62, # max. time @@REG@@ can be used before reset
reset=('rnd', (-0x7ff, 0x7ff)) # add to branch reg every iteration
),
Hello @rbertran,
In the above code, should ideally only the value of N ( self.args.random_branch) be varied?
What about distance and musage? Also the example in riscv_branch.py has musage=62. Is that intentional?
Thanks!
Happy to report that by varying the value of N from 1 to 100, I am able to get branch misprediction rates ranging from ~50% to <1%. I think this is perfect!
I only run into 1 issue which occurs sometimes - I get the following error when I try to compile:
mkdir -p riscv_flex
make -C riscv_flex -f /research/sgokul/MicroProbe_Dev/microprobe/targets/riscv/examples/build_flex/../riscv_flex/Makefile abs_top_srcdir=/research/sgokul/MicroProbe_Dev/microprobe/targets/riscv/examples/build_flex/.. XLEN=64 src_dir=/research/sgokul/MicroProbe_Dev/microprobe/targets/riscv/examples/build_flex/../riscv_flex RISCVTOOLS=/research/sgokul/RISCV/riscv-tools
make[1]: Entering directory '/research/sgokul/MicroProbe_Dev/microprobe/targets/riscv/examples/build_flex/riscv_flex'
riscv64-unknown-elf-gcc -static -mcmodel=medany -fvisibility=hidden -nostdlib -nostartfiles -I/research/sgokul/MicroProbe_Dev/microprobe/targets/riscv/examples/build_flex/.. -I/research/sgokul/RISCV/riscv-tools -I/research/sgokul/RISCV/riscv-tools/riscv-tests/env/p -T/research/sgokul/RISCV/riscv-tools/riscv-tests/env/p/link.ld -o riscv_new-p-gokul_12345 /research/sgokul/MicroProbe_Dev/microprobe/targets/riscv/examples/build_flex/../riscv_flex/gokul_12345.S
/tmp/ccsdrFlo.o: In function `stream1_pcrel_10_braid':
(.text.init+0x1a34): relocation truncated to fit: R_RISCV_JAL against `*UND*'
/research/sgokul/BOOM/boom-template/lib/gcc/riscv64-unknown-elf/7.2.0/../../../../riscv64-unknown-elf/bin/ld: final link failed: Symbol needs debug section which does not exist
collect2: error: ld returned 1 exit status
/research/sgokul/MicroProbe_Dev/microprobe/targets/riscv/examples/build_flex/../riscv_flex/Makefile:30: recipe for target 'riscv_new-p-gokul_12345' failed
make[1]: *** [riscv_new-p-gokul_12345] Error 1
make[1]: Leaving directory '/research/sgokul/MicroProbe_Dev/microprobe/targets/riscv/examples/build_flex/riscv_flex'
Makefile:17: recipe for target 'riscv_flex' failed
make: *** [riscv_flex] Error 2
I am using the following passes:
passes = [
structure.SimpleBuildingBlockPass(self.args.loop_size),
initialization.ReserveRegistersPass(reserved_registers), #new
instruction.SetInstructionTypeByProfilePass(thisdict),
initialization.InitializeRegistersPass(value=RNDINT,fp_value=RNDINT),
initialization.InitializeRegisterPass("X1",1), #new - global pattern option #1 (other option is X0)
initialization.InitializeRegisterPass("X2", int(local_pattern, 2)),#entire local pattern is held here
initialization.InitializeRegisterPass("X3", 0),#loop count
initialization.AddInitializationAssemblyPass("andi x4, x2, 1"), #x4 starts with lowermost bit of x2
instruction.SetInstructionOperandsByOpcodePass(branch_instrs_names, 1, self.target.isa.registers["X4"]), #new - Operand 1 of all branches will bw x4 - this is local
instruction.SetInstructionOperandsByOpcodePass(branch_instrs_names, 2, global_pattern_regs), #new - Operand 2 of all branches will be X0/X1 - this is global
# Randomize branches for BGT 0 every N branches
branch.RandomizeByTypePass(
branch_instrs, # instructions to replace
self.target.isa.instructions['BGE_V0'], # new instruction
100, # every N instructions
"slli @@BRREG@@, @@REG@@, @@COUNT@@", # randomization code
# to add before branch
distance=3, # distance between randomization code and branch
musage=32, # max. time @@REG@@ can be used before reset
reset=('rnd', (-0x7ff, 0x7ff)) # add to branch reg every iteration
),
initialization.AddFinalizationAssemblyPass(
"addi x3, x3, 1 \n" + # Updating loop count
# Compare and reset based on pattern length
"slti x4, x3, %d\n" % min(64, len(local_pattern)) + #x4 is used as a temp variable here
"bne x4, x0, 8 \n" + #
"addi x3, x0, 0 \n" + # Reset to zero
"addi x0, x0, 0" # nop
),
initialization.AddFinalizationAssemblyPass(
"srl x4, x2, x3\n"+
"andi x4, x4, 1"
), #put new value into x4
memory.GenericMemoryStreamsPass([[1, 64*1024, 1, 128, 1, 0,(1,0)]]),
register.DefaultRegisterAllocationPass(dd=6),
address.UpdateInstructionAddressesPass(),
branch.BranchBraidNextPass()
It appears independent of the value of N I set. I get error scenario as well as correct scenarios on different trails for the same N value. Maybe this is affected by the random numbers used in some instructions?
I will try to debug more and update. Thanks!
~50% misspredictions, great! answering you questions:
callable
object that is called for each branch and whenever the call returns True
, the branch becomes random
. Currently, with the integer setting the granularity is quite limited since one can only set 100%, 50%, 33%, 25%, 20% ... (every 1, 2, 3,4, 5, ...) of the branches to be random. With the callable object any % can be implemented. shift left
instruction, up to 63 times can be used a particular register before becoming a 0
value (64 bit wide reg). So, any value below will increase the number of register used (if needed) and probably increase the randomness, but I don't think it will affect the branch pattern outcome too much. Values above 63 will lead to zero after the 63rd time a register is used and therefore the following branches will not be random. 62 is just the example that minimizes register usage (63 will be better ;) ), any lower value should be ok, but depending on the amount of random branches you'll hit errors due to lack of free registers during code generation. riscv-toolchain
repo. I just implemented the better control of the branch randomization probability (repo updated). If a value N >1
is provided, 1
every N
branches are going to be randomized (prior behavior). If a fraction N
between in the [0,1]
is provided, then each branch will have a probability N
to be randomized. E.g. if 0.22
is provided and there are enough branches in the generated code, then ~22% of them will be randomized.
I'm still facing the linker error from time to time without an apparent reason from the assembly point of view. I'm still debugging that issue....
A bit more light on the linker problem . The current linker script (/research/sgokul/RISCV/riscv-tools/riscv-tests/env/p/link.ld
in your setup), contains the following:
OUTPUT_ARCH( "riscv" )
ENTRY(_start)
SECTIONS
{
. = 0x80000000;
.text.init : { *(.text.init) }
. = ALIGN(0x1000);
.tohost : { *(.tohost) }
. = ALIGN(0x1000);
.text : { *(.text) }
. = ALIGN(0x1000);
.data : { *(.data) }
.bss : { *(.bss) }
_end = .;
}
If one replaces the . = 0x80000000;
(initial address) to be . = 0x00000000
, (or any displacement below 0x100000
) the linker problem goes away. That said, I'm not sure how that affects the rest of your toolchain. In my particular case, the change breaks the interface with the spike
simulatior, but it might be ok for your setup.
This is just a workaround, but I'm still trying to understand the real bug
that we are hitting. If there is a problem with the displacement, we'd hit it in all the cases or cases above a particular loop size, but it seems that we hit the issue in particular loop size .... but not on the adjacent ones, which is quite weird... so probably has something to do with the code Microprobe generates.
I found the issue: it was in the example generation policy. If the last instruction of the sequence generated was a branch, it was not set. I enable the force
mode, and the problem is gone. Check out the new example code once I upload it in the next minutes. You can forget my previous comment ;)
Updating my comment since I didn't see the above:
Awesome - I will try it out.. Since I usually launch the microprobe based generation 100s of times in my synthetic workload analysis, I will verify that the issue is never hit. Thanks!
Not hitting any errors over multiple runs, so things look sorted out. Closing the issue, thanks!
Hello @rbertran,
Can you provide some info on how to control the branch taken / not-taken patterns. I see that there is an "InitializeBranchDecorator" pass which deals with T / NT but I'm not sure about what it does, and if its the right one how to use it. Thanks!