Implement Branch Prediction Unit and introduce it into pipeline

pavelkryukov commented 7 years ago

Let's start with #36

pavelkryukov commented 7 years ago

Regarding the task per se:

Currently we have Cache_Tag_Array class which implements cache indexing & replacement functionality. As a first step, you have to extend it as BP class which can contain some data. It should be something like:

class BPEntry {
    // ...
public:
    bool get_prediction() const;
    uint64 get_target() const;
};

class BP {
    Cache_Tag_Array tags;
    std;:vector<std::vector<BPEntry>> data;
public:
    Cache(unsigned int size_in_entries,
               unsigned int ways);
};

Then, you have to add lookup. If there is a cache miss, you fetch PC + 4. If there is a cache hit and prediction is true, you fetch get_target(). Prediction information is attached to FuncInstr's internal fields.

pavelkryukov commented 7 years ago

Then, you need to model Branch Misprediction Unit (I don't think you need a C++ class here).

It should reside on "M" stage and track all branches with incorrect prediction. If such branch is detected, pipeline flush is performed (the next instruction is ignored by BMU, all values from ports are invalidated), PC is updated with correct value and BP is updated with new information.

gkorepanov commented 7 years ago

Then, you have to add lookup. If there is a cache miss, you fetch PC + 4. If there is a cache hit and prediction is true, you fetch get_target()

Where should I place that lookup? In our scheme, there is a mutex which is controlled by BMU and BTB. I have to insert that functionality into fetch stage, haven't I?

gkorepanov commented 7 years ago

Branch Misprediction Unit (I don't think you need a C++ class here)

Again, should it be implemented within fetch stage of perf_sim?

pavelkryukov commented 7 years ago

I have to insert that functionality into fetch stage, haven't I?

Right.

Again, should it be implemented within fetch stage of perf_sim?

It should reside on "M" stage

gkorepanov commented 7 years ago

Are you satisfied with the current state of Cache_Tag_Array? Are there any problems or lack of functionality to be comprised of by BTB?

gkorepanov commented 7 years ago

It should reside on "M" stage

Sorry, I meant it exactly. Thanks.

pavelkryukov commented 7 years ago

Are you satisfied with the current state of Cache_Tag_Array? Are there any problems or lack of functionality to be comprised of by BTB?

I'm leaving this to you.

gkorepanov commented 7 years ago

I have Implemented the BPU but haven't introduced it to pipeline yet. I believe I will finish the work tomorrow evening or the day after tomorrow morning.

pavelkryukov commented 7 years ago

Let's align on prediction modes:

1-bit predictor (taken/no-taken)
2-bit predictor (W/S T/NT)
Multilevel predictor

Also, please add ability to get non-dynamic predictions:

Predict always taken (if target exists)
Predict taken if target is less than PC (backward jumps)

There should be command line options to select one of these predictors and its size: --bp_mode=multilevel --bp_size=32 --bp_ways=2

pavelkryukov commented 7 years ago

Also, branch prediction accuracy should be printed in the end of simulation (from 0% to 100%)

gkorepanov commented 7 years ago

While introducing the BP into the pipeline I had to make quite substantial changes in the performance simulator. I hope to finish the main work tonight or tomorrow morning.

gkorepanov commented 7 years ago

Progress:

[x] Implement BPU
[x] Write simple unit tests
[ ] Split BPEntry to different realisations
[x] Reorganise pipeline
[x] Compile new pipeline with inserted PB
[ ] Test the new pipeline
[ ] Add ability to get non-dynamic predictions
[x] Add command line options
[ ] Track branch prediction accuracy
[ ] Generate the stress test
[ ] Run the test with different parameters

gkorepanov commented 7 years ago

Finished the main part. Now compiling and testing.

gkorepanov commented 7 years ago

@pavelkryukov, I probably need your help to proceed, as I have some principal problem with understanding of how to make stalls.

Firstly, I had to introduce the BPU into the pipeline. That required more information to be send through latches (the predicted target, not taken target, predicted destination and the instruction itself), so I had to create some structs with this information, and the structure of information had to be the same for neighbouring stages.

So I've split the stages into different classes, provided latches classes holding the structure of information to be sent between stages and ports references.
That was quite straight-forward, as I just kept the initial way of operation for all the stages. But I tried to bound the code to the way the real processor work. So I found it very strange to update PC in the decode stage, as it has to be done earlier, at least on fetch stage with help of bpu.
While testing, I found out, why the PC was updated on the decode stage. When the data hazard is detected (on the decode stage), the stall is sent via DECODE_2_FETCH_STALL port to fetch stage. The problem is that by that time (on the same cycle) the subsequent instruction (the instruction fetched right after the stalling instruction) is already written to FETCH_TO_DECODE port.

On the next cycle fetch stage stalls, so does the decode stage, and the information within that port is irreversibly lost.
That is what surprised me: the data might be read from the port only at the exact moment (when cycle = cycle_written + latency) , after that it is lost. That is not the way D-latches operate, as far as I understand: they keep the data written to them until the write_enable is triggered and the info is updated.
It would be quite easy to keep information in the ports till the moment it is needed, and implement a write enable flag in them. But I guess there is some reason why it is done differently.

So I will be grateful for explanation on why the ports are operating in that way, and whether I'm free to change it or there is some other way round this problem.

I have pushed all the code to BPU tree, it's available here (it fails now to operate correctly when the data hazard is detected, due to the above-mentioned problem.

P.S. Also it's quite puzzling me that we have special ports for stalling. They operate the opposite way the main lathes (IF/ID, ID/EX...) do. There are no similar ports in "real" processor we study, as far as I understand. But at least that works flawlessly, and allows to delay the pipeline (for now, only the fetch stage, in fact).

pavelkryukov commented 7 years ago

Firstly, I had to introduce the BPU into the pipeline. That required more information to be send through latches (the predicted target, not taken target, predicted destination and the instruction itself), so I had to create some structs with this information, and the structure of information had to be the same for neighbouring stages.

After decoder, information should be kept in a struct inside FuncInstr

So I will be grateful for explanation on why the ports are operating in that way, and whether I'm free to change it or there is some other way round this problem.

No. Ports are widely used concept, I don't see any reason one should change them.

On the next cycle fetch stage stalls, so does the decode stage, and the information within that port is irreversibly lost.

So why not to keep the data in special field? In previous implementation there was a field decode_data. If there was new data from F-2-D port, it would be overwritten, otherwise it is kept the same.

pavelkryukov commented 7 years ago

You might find this article useful: http://people.csail.mit.edu/emer/papers/2002.02.computer.asim.pdf

gkorepanov commented 7 years ago

So why not to keep the data in special field? In previous implementation there was a field decode_data.

That is clear, that field was needed to keep the stalling instruction till the writeback writes sources for it. But the instruction right after stalling one would have been lost unless the PC is not updated in decode stage.

gkorepanov commented 7 years ago

After decoder, information should be kept in a struct inside FuncInstr

Why the prediction information should be kept inside FuncInstr? What this information has to do with functional simulation?

gkorepanov commented 7 years ago

You might find this article useful: http://people.csail.mit.edu/emer/papers/2002.02.computer.asim.pdf

I will read it, thank you. Seems the information is strongly related.

pavelkryukov commented 7 years ago

I still don't see an issue

F, cycle -1: PC = 0 is sent to D D, cycle -1: PC = -4 is decoded E, cycle -1: PC = -8 is executed

F, cycle 0: updating PC := 4, PC = 4 is sent to D D, cycle 0: PC = 0; stall detected, stall signal sent, current instruction is saved to "decode_data" E: cycle 0: PC = -4 is executed

F, cycle 1; stall is received, no PC update, PC = 4 is sent to D D, cycle 1: as stall was sent on previous cycle, instruction PC = 0 is decoded from "decode_data", stall signal sent E: bubble

F, cycle 2: stall is received, no PC update, PC = 4 is sent to D D, cycle 2: PC = 0 is decoded, unstall, unstall signal sent E: bubble

F, cycle 3: updating PC := 8, PC = 8 is sent to D D, cycle 3, PC = 4 is decoded E cycle 3, PC = 0 is executed

pavelkryukov commented 7 years ago

Why the prediction information should be kept inside FuncInstr? What this information has to do with functional simulation?

Well FuncInstr is just a bad name for class, it should be Instr. Maybe we will rename it someday...

gkorepanov commented 7 years ago

Well FuncInstr is just a bad name for class, it should be Instr. Maybe we will rename it someday...

Got it now. Then I should probably get rid of Latches classes, as they are too heavy to keep just one structure:

struct Data {
            bool predicted_taken;    // Predicted direction
            addr_t predicted_target; // PC, predicted by BPU
            addr_t PC;               // current PC
            uint32 raw;              // fetched instruction code
        };

and the ports might be held in PerfMIPS itself

pavelkryukov commented 7 years ago

Looks good (it should be FetchData, though)

gkorepanov commented 7 years ago

F, cycle -1: PC = 0 is sent to D D, cycle -1: PC = -4 is decoded E, cycle -1: PC = -8 is executed

F, cycle 0: updating PC := 4, PC = 4 is sent to D D, cycle 0: PC = 0; stall detected, stall signal sent, current instruction is saved to "decode_data" E: cycle 0: PC = -4 is executed

F, cycle 1; stall is received, no PC update, PC = 4 is sent to D D, cycle 1: as stall was sent on previous cycle, instruction PC = 0 is decoded from "decode_data", stall signal sent E: bubble

F, cycle 2: stall is received, no PC update, PC = 4 is sent to D D, cycle 2: PC = 0 is decoded, unstall, unstall signal sent E: bubble

F, cycle 3: updating PC := 8, PC = 8 is sent to D D, cycle 3, PC = 4 is decoded E cycle 3, PC = 0 is executed

That's is exactly how it works.

Now I understand what I have missed: even though fetch received a stall, it sends PC = 4 again. So the previous dispatch (PC = 4) is lost, but the next (again, PC = 4) is delivered with no problem.

gkorepanov commented 7 years ago

Looks good (it should be FetchData, though)

Name is just copied from my code, where the Data strict is a part of IF/ID latch

gkorepanov commented 7 years ago

Thank you very much, now the problem is over. And there is one more question, is it OK to flush ports as it is done now in the code:

/* branch misprediction unit */
    if ( (actually_taken != data.predicted_taken) ||
         ( (actually_taken == data.predicted_taken) &&
           (real_target    != data.predicted_target)))
    {
        /* flushing the pipeline */

        /* flushing data ports */
        sim.if_id.read_data_port->flush();
        sim.id_ex.read_data_port->flush();
        sim.ex_mem.read_data_port->flush();

        /* flushing stall ports */
        sim.if_id.read_stall_port->flush();
        sim.id_ex.read_stall_port->flush();
        sim.ex_mem.read_stall_port->flush();
        sim.mem_wb.read_stall_port->flush();

        /* updating the PC with actual value */
        sim.PC = real_target;

        /* delete internal data from stages if there was some */
        sim.decode_stage.flushData();

        sout << RED << "misprediction\n" << DCOLOR;
        return;
    }

gkorepanov commented 7 years ago

I should provide that the pipeline is cleared and no data will be read form ports next cycle. So it seemed to me the the most clear approach would be just to flush the ports. I'm wondering is it a good way or something else should be done like introducing the flags in the stages like data_is_valid (I disliked the latter approach as it would be hard to take everything in account when ports latencies will be different from 1 and the pipeline will be non-unified)

pavelkryukov commented 7 years ago

And there is one more question, is it OK to flush ports as it is done now in the code:

No, it is completely incorrect. The basic idea of port is to be easily trackable and understandable — with these flushes nobody will ever understand what, when, and where had destroyed the data.

Please, do not try to reinvent the ~~wheel~~ clock-precise simulation infrastructure, or consult me first at least. Even if you found a genius solution, please do not use it in this simulator — as it is used for educating students who will work in real projects in Intel and everywhere else with common worldwide accept infrastructure.

pavelkryukov commented 7 years ago

(BTW, the correct way to flush pipeline is to send new "flush" signal to each pre-execution stage via dedicated ports)

gkorepanov commented 7 years ago

Please, do not try to reinvent the wheel clock-precise simulation infrastructure, or consult me first at least. Even if you found a genius solution, please do not use it in this simulator — as it is used for educating students who will work in real projects in Intel and everywhere else with common worldwide accept infrastructure.

I do understand that there is a commonly accepted infrastructure, but unfortunately I'm not acquainted with it. In that particular case I just tried to find a way to flush the pipeline, and I didn't think asking thousands of questions and consequently wasting your time would have been a good idea. So I just done it in some way -- I don't think it was a completely useless experience -- and then asked you whether it was a good approach.

gkorepanov commented 7 years ago

(BTW, the correct way to flush pipeline is to send new "flush" signal to each pre-execution stage via dedicated ports)

Well, all of those stages will flush their data, but there is some data unread in ports, and those stages will inevitably receive it, how to handle this?

pavelkryukov commented 7 years ago

Well, all of those will flush their data, but there is some data unread in ports, and those stages will inevitably receive it, how to handle this?

Read the data and ignore it

gkorepanov commented 7 years ago

And if the latency was greater than 1? Just wait in a cycle till there is some first data and ignore it? Then how I would understand that there probably was not any data at all (there was a bubble on the previous stage, for example) and the data came shouldn't be ignored?

pavelkryukov commented 7 years ago

I do understand that there is a commonly accepted infrastructure, but unfortunately I'm not acquainted with it. In that particular case I just tried to find a way to flush the pipeline, and I didn't think asking thousands of questions and consequently wasting your time would have been a good idea.

Well, I am here to show how to do right things right, it is my duty to answer the questions.

gkorepanov commented 7 years ago

Thank you very much, I will consult with you next time first so as not to waste the time.

pavelkryukov commented 7 years ago

And if the latency was greater than 1? Just wait in a cycle till there is some first data and ignore it? Then how I would understand that there probably was not any data at all (the bubble was executed, for example) and the data came shouldn't be ignored?

In cycle 0, all modules read the data from all datapath ports and do not forward it further, so you have bubbles pushed to the next stages, right? In cycle 1, there is no data in the pipeline.

gkorepanov commented 7 years ago

All modules read the data from all datapath ports and do not forward it further, so you have bubbles pushed to the next stages, right?

Actually for now we don't have bubbles pushed to the next stages, we have nothing pushed to the next stages, that's how next stages understand they have to do nothing as well. You mean, we actually should forward further the data containing the NOP instruction or something like this?

pavelkryukov commented 7 years ago

You mean, we actually should forward further the data containing the NOP instruction or something like this?

No, I meant "no-data" case.

gkorepanov commented 7 years ago

Well, and what is the answer to the initial question then? How we can determine whether there was no data at all or it is just the latency what prevents us from getting the data from the port?

pavelkryukov commented 7 years ago

How we can determine whether there was no data at all or it is just the latency what prevents us from getting the data from the port?

I don't understand why we may need to distinguish between these two cases...

gkorepanov commented 7 years ago

I'm trying now to finish my work by writing the correct flushing code, as you suggested

(BTW, the correct way to flush pipeline is to send new "flush" signal to each pre-execution stage via dedicated ports)

I wondered, what is to be done with data left inside the ports. We should simply

Read the data and ignore it

The problem is as follows. The stage after receiving the "flush" signal should read some data. We might fall into 3 cases:

The data is immediately read and ignored.
The data is not read immediately, so stage is waiting for it and after eventually reading it ignores.
There is no data in a port at all. We wait for something, and finally get a proper instruction which has gone through pipeline already after the flush, and we ignore it thinking it's an illegal one which is to be flushed.

pavelkryukov commented 7 years ago

Please point where the issue is:

         F    | D    | E    | M | W
==================================
cycle 0: e    | d    | c    | b | a
cycle 1: f    | e    | d(!) | c | b — flush signal is sent to all previous stages, new address sent to F
cycle 2: x(*) | f(*) | e(*) | d | c — "f" and "e" are not sent to the next stages, pushing "x" instead of "g"
cycle 3: y    | x    | 0    | 0 | d — E and M receive nothing
cycle 4: z    | y(!) | x    | 0 | 0 — M and W receive nothing, "y" is stalled due to dependency and is not pushed
cycle 5: z(^) | y(!) | 0    | x | 0 — E receives nothing
cycle 6: z(^) | y(!) | 0    | 0 | x
cycle 7: z(^) | y    | 0    | 0 | 0
cycle 8: w    | z    | y    | 0 | 0

(!) — special singal sent
(*) — flush received
(^) — stall received

pavelkryukov commented 7 years ago

BTW, you will have problems with data dependencies, if "x" "depends" on "e". So you have to set valid bit for destination for canceled instructions.

gkorepanov commented 7 years ago

Please point where the issue is:

There is no issue in the suggested example, as there is a fixed stage duration and fixed ports latency, so everything is known beforehand.

But that is not right for non-unified pipeline or ports with greater latencies. Or we shouldn't assume ports might have latencies different from 1 for now?

gkorepanov commented 7 years ago

BTW, you will have problems with data dependencies, if "x" "depends" on "e". So you have to set valid bit for destination for canceled instructions.

Wow, thank you for the remark. It's the thing I missed completely while thinking over the structure.

gkorepanov commented 7 years ago

Is it OK to keep PC just as a member of PerfMIPS class? Or it is better to be implemented through ports: the fetch stage has a port from itself to itself, and there is a port from memory stage to fetch stage? (I'm wondering because actually PC is a latch in our schemes)

gkorepanov commented 7 years ago

Deducing from your description

flush signal is sent to all previous stages, new address sent to F

I think the best way will be to keep PC in fetch stage, and right PC is sent by memory stage through special port.

pavelkryukov commented 7 years ago

Is it OK to keep PC just as a member of PerfMIPS class?

Yes. It is convinient to get it from debugger, for example.

MIPT-ILab / mipt-mips

Implement Branch Prediction Unit and introduce it into pipeline #19