Implement Branch Prediction Unit and introduce it into pipeline

gkorepanov commented 7 years ago

So, it's better, than

to keep PC in fetch stage, and right PC is sent by memory stage through special port.

?

pavelkryukov commented 7 years ago

I see no difference as we don't have separate modules for now

gkorepanov commented 7 years ago

I still don't see an issue

F, cycle -1: PC = 0 is sent to D D, cycle -1: PC = -4 is decoded E, cycle -1: PC = -8 is executed

F, cycle 0: updating PC := 4, PC = 4 is sent to D D, cycle 0: PC = 0; stall detected, stall signal sent, current instruction is saved to "decode_data" E: cycle 0: PC = -4 is executed

F, cycle 1; stall is received, no PC update, PC = 4 is sent to D D, cycle 1: as stall was sent on previous cycle, instruction PC = 0 is decoded from "decode_data", stall signal sent E: bubble

F, cycle 2: stall is received, no PC update, PC = 4 is sent to D D, cycle 2: PC = 0 is decoded, unstall, unstall signal sent E: bubble

F, cycle 3: updating PC := 8, PC = 8 is sent to D D, cycle 3, PC = 4 is decoded E cycle 3, PC = 0 is executed

Now I understand what puzzles me. I the given example, the fetch stage acts in the following way:

If stall signal is not received, update PC
Fetch an instruction, instr_code = mem[PC], and send it to decode

With BPU enabled, the behaviour is slightly different, as far as I understand from our presentation about BPU:

If flush signal received, update PC with new value got from memory stage
Fetch an instruction, instr_code = mem[PC], and send it to decode
If If stall signal is not received, update a PC (acquire new value from BPU), so that it will be fetched in next cycle.

I don't see the way how to change the latter behaviour to make it work. Now it executes in the following way:

F, cycle -1: PC = 0 is sent to D, updating PC := 4
D, cycle -1: PC = -4 is decoded
E, cycle -1: PC = -8 is executed

F, cycle 0: PC = 4 is sent to D, updating PC := 8
D, cycle 0: PC = 0; stall detected, stall signal sent, current instruction is saved to "decode_data"
E: cycle 0: PC = -4 is executed

F, cycle 1; stall is received, no PC update, PC = 8 is sent to D
D, cycle 1: as stall was sent on previous cycle, instruction PC = 0 is decoded from "decode_data", stall signal sent
E: bubble

F, cycle 2: stall is received, no PC update, PC = 8 is sent to D
D, cycle 2: PC = 0 is decoded, unstall, unstall signal sent
E: bubble

F, cycle 3: PC = 8 is sent, updating PC := 12
D, cycle 3, PC = 8 is decoded (PC = 4 is LOST!)
E cycle 3, PC = 0 is executed

pavelkryukov commented 7 years ago

What if "new_PC" value is stored? Then, if you received stall, old PC is used, if you had not, use new PC and update it.

gkorepanov commented 7 years ago

That will solve a problem, of course. Thanks.

gkorepanov commented 7 years ago

Let's align on prediction modes:

1-bit predictor (taken/no-taken) 2-bit predictor (W/S T/NT) Multilevel predictor Also, please add ability to get non-dynamic predictions:

Predict always taken (if target exists) Predict taken if target is less than PC (backward jumps)

I'm trying to implement it now. I have a problem connected with language:

I would like to make BPEntry be an abstract class, and others, like MultilevelBPEntry will be derived from it. So I replace the declaration

std::vector<std::vector<BPEntry>> data;

with vector, containing pointers to base class:

std::vector<std::vector<BPEntry*>> data;

However, using raw pointer to base class is considered to be unsafe, so I wrap it with unique_ptr:

std::vector<std::vector<std::unique_ptr<BPEntry>>> data;

Now, I want to initialize that list:

data( ways, std::vector<std::unique_ptr<BPEntry>>( size_in_entries / ways, std::make_unique<MultilevelBPEntry>( *this, log)))

However, C++ documentation makes it clear that elements of an initializer list are always passed via const-reference, thus movable only unique_ptr's produce error. What is the best way of overcoming this? In my practice, I've faced only with emplace_back and manual memory management in similar cases.

pavelkryukov commented 7 years ago

I'm not sure that keeping unique_ptr inside std::vector is a good idea. As a global alternative I suggest to create base class for BP, not BPEntry.

class BaseBP
{
public:
    virtual void lookup(...); // intefaces;
    virtual ~BaseBP() { } // don't forget this 
};

template<typename T>
class BP : public BaseBP
{
    std::vector<std::vector<T>> data;
public:
    void lookup(...) final; // implementation
};

Then, to generate correct type of BP, you may use if...else comparison for input string. Or, you may create factory class:

class BPFactory {
    class BaseBPCreator {
    public:
        virtual BaseBP* create(...) const = 0;
    };

    template<typename T>
    class BPCreator : public BaseBPCreator {
    public:
        virtual BaseBP* create(...) const final { return new BP<T>(...); }
    };

    std::map<std::string, BaseBPCreator*> map;
    template<typename T>
    void add(const std::string& name) {
            // check if the name does not exist
            map[name] = new BPCreator<T>();
    } 
public:
    BPFactory() {
         map.add<SimpleBPEntry>("simple");
         map.add<ComplicatedBPEntry>("complicated");
         // ... as many as you wish
    }
   // can we use initializer list, by the way?

    BaseBP* create(const std::string& name, args ...) const {
          // you need to check if "name" exists in map, though
          return map.at(name)->create(args ...);
    }

    ~BPFactory() { // don't forget about this }
};

BPFactory bp_factory;
BaseBP* bp = bp_factory.create( config.bp_mode, config.bp_size, config.bp_ways);

gkorepanov commented 7 years ago

As a global alternative I suggest to create base class for BP, not BPEntry.

That will even probably work faster since no dynamic casts are executed, the only overhead is initialisation. Seems I still need to define a base class for BPEntry, as they have very similar interface.

Using a polymorphism in C++ was always a kind of headache for me, since it always entails big difficulties. BTW, I have studied how people deal with need to create a vector of polimorhic objects, they either use shared_ptr instead or use raw pointers or implement their own version of vector allowing for movable-only semantics of unique_ptr and making the smart pointer from raw one passed in constructor inside.

pavelkryukov commented 7 years ago

Seems I still need to define a base class for BPEntry, as they have very similar interface.

You may, but it is not required. The good thing is that you do not need to use virtual functions in BPEntry, as they would be compile-time defined.

I have studied how people deal with need to create a vector of polimorhic objects

Our case is not the good example of container of polymorphic objects, as they have the same type. Also, since vector of pointers has lower data density, we break spatial locality and performance.

gkorepanov commented 7 years ago

The good thing is that you do not need to use virtual functions

And what to use then? I thought I will provide interface-implementation structure like:

class BPEntry : protected Log {
    public: 
        /* prediction */
        virtual bool isTaken() const = 0;        
        virtual addr_t target() const = 0;  

        /* update */
        virtual void update( bool is_actually_taken, addr_t target) = 0;
};

class CompiclatedBPEntry : public BPEntry {
    public: 
        /* prediction */
        bool isTaken() const { ... } // implementation

        addr_t target() { ... }  // implementation

        /* update */
        void update( bool is_actually_taken, addr_t target) { ... } implementation
};

gkorepanov commented 7 years ago

we break spatial locality and performance

Wow, that is how cache performance might be taken into consideration while writing programs...

BTW, could you please review how the pipeline flush is done now? here are the lines

pavelkryukov commented 7 years ago

Firstly, BPEntry should not inherit from Log, this class is needed only for big structures (FE, BP, DC etc.)

Secondly, interface for BPEntry is implicitly defined by BP methods. If some method is missing, template instantiation will fail.

gkorepanov commented 7 years ago

interface for BPEntry is implicitly defined by BP methods

That is clear now. Thank you.

pavelkryukov commented 7 years ago

Some comments:

That will even probably work faster since no dynamic casts are executed, the only overhead is initialisation.

Not exactly, Irremovable overhead is introduced by virtual calls of BaseBP methods.

That will even probably work faster since no dynamic casts are executed.

Generally, dymanic_cast<> is required only in exceptional cases, usually when object is transferred from one "code space" to another. If dynamic_cast<> occurs in simple code, it is usually result of bad design.

gkorepanov commented 7 years ago

Sorry, I haven't finished my message :-)

gkorepanov commented 7 years ago

Thanks a lot for clarifications!

What's more about different types: probably it's better not to use templated BP (just explicitly implement different versions), as all they have different constructors?

Now the types of BP are:

Static predictor (always taken)
Static predictor (backward jumps)
1-bit predictor
2-bit predictor
Two-level adaptive predictor
Multi-ilevel multi-bit adaptive predictor //probably just don't need it?

And the constructor allowing for all the types would look like

create(bool log,
        std::string bp_mode,
        unsigned int   size_in_entries,
        unsigned int   ways,
        unsigned short prediction_bits,
        unsigned short prediction_level,
        unsigned short branch_ip_size_in_bits)

pavelkryukov commented 7 years ago

as all they have different constructors?

Why? We have only 3 arguments to BP constructor (plus "mode" argument passed to Factory)

size
ways
IP size

Prediction bits and prediction level are BPEntry-specific static constants.

gkorepanov commented 7 years ago

The multilevel predictor allows us to use any number of prediction pits as well as any prediction level, and so one might want to test it with different parameters. If not, then which constants should be set?

It's only

Multi-ilevel multi-bit adaptive predictor

what complicates the things, actually.

pavelkryukov commented 7 years ago

We agreed to have only 5 modes now, right? (Oh, I see, I meant two-level predictor)

gkorepanov commented 7 years ago

That's OK then, I was simply mislead by that misprint)

pavelkryukov commented 7 years ago

Any updates?

gkorepanov commented 7 years ago

I'm running out of time due to my studies, so quite little time could spend to write code yet. Today finished rewriting the pipeline in previously used way (with minor changes to current structure, but with branch prediction functionality, so as not to go ahead with classes etc.) and started to create fabric class for BPU. I doesn't look like complicated task, as it just requires moving the parts of code, splitting the parts into different classes and so on. Tomorrow morning I will have 3-4 hours, so I will try to complete the ongoing work (and apply the fixes you suggested like creating different const reading method in cache tag array, checking whether the jump executed in a different way within FuncInstr and so on).

pavelkryukov commented 7 years ago

Thanks for the update.

I suggest you to merge with the latest version — I slightly changed configuration processing. The only thing you need to do is to add these lines to perf_sim.cpp:


namespace config {
    static Value<std::string> bp_mode = // ..
    static Value<uint32> bp_size = // ..
    static Value<uint32> bp_ways = // ..
}

// constructing of PerfSim:
BPFactory bp_factory;
bp = bp_factory.create( config::bp_mode, config::bp_size, config::bp_ways);

pavelkryukov commented 7 years ago

(and BP bits, sure)

gkorepanov commented 7 years ago

I suggest you to merge with the latest version — I slightly changed configuration processing.

Yeah, I'm always tracking your updates and rebasing my local branch. Thank you for updates made, each one makes it more convenient to work with the code.

gkorepanov commented 7 years ago

(and BP bits, sure)

What do you mean by that? branch_ip_size_in_bits? Or simply the variable determining whether 1-bit predictor or 2-bit predictor would be chosen? Probably it is determined by the bp_mode already?

(we have only five predictors now,

Static predictor (always taken)
Static predictor (backward jumps)
1-bit predictor
2-bit predictor
Two-level adaptive predictor

I have simply named the corresponding modes static_always_taken, static_backward_jumps, dynamic_one_bit, dynamic_two_bit, adaptive_two_level)

pavelkryukov commented 7 years ago

branch_ip_size_in_bits?

yes

gkorepanov commented 7 years ago

Ok, thanks

gkorepanov commented 7 years ago

@pavelkryukov, could you please have a look at changes in cache_tag_array? BPU needs to know way number to update the data, so not only the hit/miss bool flag should be returned, but also the way number. Is it OK to do it by passing the plain pointer as an argument? I think, creating a special struct like

{
    bool is_hit;
    unsigned int way;
}

or returning an std::tuple<bool, way> is an overhead here, and returning something like -1 as a way number to represent a miss is particularly unclear way.

Here are the updates.

pavelkryukov commented 7 years ago

I've merged your changes, they look good

Is it OK to do it by passing the plain pointer as an argument?

Let's keep it for now — we will revise it when IC and DC models are delivered.

gkorepanov commented 7 years ago

Thank you, I'm implementing different BP entries now, will report when finish and test.

pavelkryukov commented 7 years ago

Please provide new IPC results for all traces when you ask for final pull request

pavelkryukov commented 7 years ago

Also, the name of addr_t should be Addr.

using Addr = uint32;

gkorepanov commented 7 years ago

Also, the name of addr_t should be Addr.

Changed.

I'm gradually approaching to finish, and, surprisingly, the questions have arisen with static predictors.

Prediction always taken doesn't obviously need any info from cache (and thus, as far as I understand, is called static -- it doesn't need any dynamic info acquired during execution). So I should explicitly specialise the templated BP for BPEntryAlwaysTaken class, right?
Prediction backward jumps need to know the target of the instruction to operate. Should it be stored in the cache? If yes, it is simple, but then I don't understand, why it is called static...

Currently the code for other predictors look like this:

    /* prediction */
    Direction getDirection( Addr PC) final
    {
        unsigned int way;
        if ( tags.read_no_touch( PC, &way)) // hit
            return data[ way][ set(PC)].getDirection();

        return false;
    }
    addr_t getTarget( Addr PC) final
    {
        if ( tags.read_no_touch( PC, &way)) // hit
            return data[ way][ set(PC)].getTarget();
        else // miss
            return PC + 4;
    }

    /* update */
    void update( Direction direction,
                 addr_t branch_ip,
                 addr_t target = NO_VAL32) final
    {
        unsigned int way;
        if ( !tags.read( PC, &way)) // miss
            tags.write( branch_ip, &way); // add new entry to cache

        data[ way][ set( branch_ip)].update( direction, target);
    }

pavelkryukov commented 7 years ago

So I should explicitly specialise the templated BP for BPEntryAlwaysTaken class, right?

yes

pavelkryukov commented 7 years ago

Should it be stored in the cache?

Yes. Direction is predicted static, target isn't. Later we are going to split BTB and BP to the separate units, so it would matter

pavelkryukov commented 7 years ago

Prediction always taken doesn't obviously need any info from cache.

What about target?

pavelkryukov commented 7 years ago

I also suggest to rename getDirection to "is_taken" and use boolean variables instead of Direction enum – we are not going to have 3rd direction type

gkorepanov commented 7 years ago

What about target?

Yeah, of course it should be stored just like with backward jumps predictor.

Thank you!

I'm going to have a flight in ten minutes, so can't be online anymore till departure. Thanks for so quick response -- I have something to do during the flight!)

gkorepanov commented 7 years ago

I also suggest to rename getDirection to "is_taken" and use boolean variables instead of Direction enum – we are not going to have 3rd direction type

OK, that's simpler, of course.

pavelkryukov commented 7 years ago

So, where is the code? Deadline was yesterday...

gkorepanov commented 7 years ago

I'm sorry for being so late with it, just pushed the code, but I haven't tested it yet and there are a couple of TODO's in it, I am working with it now and fixing mistakes done.

gkorepanov commented 7 years ago

Applied fixes you suggested. Now testing and fixing major implementation errors (as now even unit-tests are failed)

pavelkryukov commented 7 years ago

bpu.h:

    /* update */
    void update( bool is_taken,
                 Addr branch_ip,
                 Addr target = NO_VAL32) final
    {
        unsigned int way;
        if ( !tags.read( branch_ip, &way)) // miss
            tags.write( branch_ip, &way); // add new entry to cache

        data[ way][ set( branch_ip)].update( is_taken, target);
    }

If there is a cache miss, new entry will be ill-formed (old entry is updated with new data). Should be:

    /* update */
    void update( bool is_taken,
                 Addr branch_ip,
                 Addr target = NO_VAL32) final
    {
       unsigned int way;
       if ( !tags.read( branch_ip, &way)) // miss
       {
            tags.write( branch_ip, &way); // add new entry to cache
            data[ way][ set( branch_ip)].reset();
       }

       data[ way][ set( branch_ip)].update( is_taken, target);
    }

pavelkryukov commented 7 years ago

return std::unique_ptr<BaseBP>{ std::make_unique<BP<T>>( size_in_entries,
                                                                     ways,
                                                                     branch_ip_size_in_bits)};

It can be simpler:

return std::unique_ptr<BaseBP>( new BP<T>( size_in_entries, ways, branch_ip_size_in_bits));

pavelkryukov commented 7 years ago

    /* acquire set number from address */
    inline unsigned int set( Addr addr)
    {
        return addr & set_mask;
    }

That method can be called from CacheTagArray

pavelkryukov commented 7 years ago

bool isTaken() const { return state; }

Better to use enum values explicitly:

bool is_taken() const { return state == T; }

pavelkryukov commented 7 years ago

init_ports();

BPFactory bp_factory;
bp = bp_factory.create( config::bp_mode, config::bp_size, config::bp_ways);

I suggest to keep init_ports() in the very bottom of PerfMIPS constructor, so there is no need to think about the order if more units are added

pavelkryukov commented 7 years ago

Then, if you received stall, old PC is used, if you had not, use new PC and update it.

Your implementation updates PC before stall checks, seems it causes problems

pavelkryukov commented 7 years ago

    if ( is_flush)
    {
        /* ignoring the upcoming instruction as it is invalid */
        rp_decode_2_execute->read( &instr, cycle);
        sout << "flush\n";
        return;
    }

I suppose you have to "validate" destination register of this instruction.

MIPT-ILab / mipt-mips

Implement Branch Prediction Unit and introduce it into pipeline #19