Closed pavelkryukov closed 7 years ago
I see no difference as we don't have separate modules for now
I still don't see an issue
F, cycle -1: PC = 0 is sent to D D, cycle -1: PC = -4 is decoded E, cycle -1: PC = -8 is executed
F, cycle 0: updating PC := 4, PC = 4 is sent to D D, cycle 0: PC = 0; stall detected, stall signal sent, current instruction is saved to "decode_data" E: cycle 0: PC = -4 is executed
F, cycle 1; stall is received, no PC update, PC = 4 is sent to D D, cycle 1: as stall was sent on previous cycle, instruction PC = 0 is decoded from "decode_data", stall signal sent E: bubble
F, cycle 2: stall is received, no PC update, PC = 4 is sent to D D, cycle 2: PC = 0 is decoded, unstall, unstall signal sent E: bubble
F, cycle 3: updating PC := 8, PC = 8 is sent to D D, cycle 3, PC = 4 is decoded E cycle 3, PC = 0 is executed
Now I understand what puzzles me.
I the given example, the fetch
stage acts in the following way:
instr_code = mem[PC]
, and send it to decode
With BPU enabled, the behaviour is slightly different, as far as I understand from our presentation about BPU:
memory
stageinstr_code = mem[PC]
, and send it to decode
I don't see the way how to change the latter behaviour to make it work. Now it executes in the following way:
F, cycle -1: PC = 0 is sent to D, updating PC := 4
D, cycle -1: PC = -4 is decoded
E, cycle -1: PC = -8 is executed
F, cycle 0: PC = 4 is sent to D, updating PC := 8
D, cycle 0: PC = 0; stall detected, stall signal sent, current instruction is saved to "decode_data"
E: cycle 0: PC = -4 is executed
F, cycle 1; stall is received, no PC update, PC = 8 is sent to D
D, cycle 1: as stall was sent on previous cycle, instruction PC = 0 is decoded from "decode_data", stall signal sent
E: bubble
F, cycle 2: stall is received, no PC update, PC = 8 is sent to D
D, cycle 2: PC = 0 is decoded, unstall, unstall signal sent
E: bubble
F, cycle 3: PC = 8 is sent, updating PC := 12
D, cycle 3, PC = 8 is decoded (PC = 4 is LOST!)
E cycle 3, PC = 0 is executed
What if "new_PC" value is stored? Then, if you received stall, old PC is used, if you had not, use new PC and update it.
That will solve a problem, of course. Thanks.
Let's align on prediction modes:
1-bit predictor (taken/no-taken) 2-bit predictor (W/S T/NT) Multilevel predictor Also, please add ability to get non-dynamic predictions:
Predict always taken (if target exists) Predict taken if target is less than PC (backward jumps)
I'm trying to implement it now. I have a problem connected with language:
I would like to make BPEntry be an abstract class, and others, like MultilevelBPEntry
will be derived from it. So I replace the declaration
std::vector<std::vector<BPEntry>> data;
with vector, containing pointers to base class:
std::vector<std::vector<BPEntry*>> data;
However, using raw pointer to base class is considered to be unsafe, so I wrap it with unique_ptr:
std::vector<std::vector<std::unique_ptr<BPEntry>>> data;
Now, I want to initialize that list:
data( ways, std::vector<std::unique_ptr<BPEntry>>( size_in_entries / ways, std::make_unique<MultilevelBPEntry>( *this, log)))
However, C++ documentation makes it clear that elements of an initializer list are always passed via const-reference, thus movable only unique_ptr's produce error.
What is the best way of overcoming this? In my practice, I've faced only with emplace_back
and manual memory management in similar cases.
I'm not sure that keeping unique_ptr inside std::vector is a good idea. As a global alternative I suggest to create base class for BP, not BPEntry.
class BaseBP
{
public:
virtual void lookup(...); // intefaces;
virtual ~BaseBP() { } // don't forget this
};
template<typename T>
class BP : public BaseBP
{
std::vector<std::vector<T>> data;
public:
void lookup(...) final; // implementation
};
Then, to generate correct type of BP, you may use if...else comparison for input string. Or, you may create factory class:
class BPFactory {
class BaseBPCreator {
public:
virtual BaseBP* create(...) const = 0;
};
template<typename T>
class BPCreator : public BaseBPCreator {
public:
virtual BaseBP* create(...) const final { return new BP<T>(...); }
};
std::map<std::string, BaseBPCreator*> map;
template<typename T>
void add(const std::string& name) {
// check if the name does not exist
map[name] = new BPCreator<T>();
}
public:
BPFactory() {
map.add<SimpleBPEntry>("simple");
map.add<ComplicatedBPEntry>("complicated");
// ... as many as you wish
}
// can we use initializer list, by the way?
BaseBP* create(const std::string& name, args ...) const {
// you need to check if "name" exists in map, though
return map.at(name)->create(args ...);
}
~BPFactory() { // don't forget about this }
};
BPFactory bp_factory;
BaseBP* bp = bp_factory.create( config.bp_mode, config.bp_size, config.bp_ways);
As a global alternative I suggest to create base class for BP, not BPEntry.
That will even probably work faster since no dynamic casts are executed, the only overhead is initialisation. Seems I still need to define a base class for BPEntry, as they have very similar interface.
Using a polymorphism in C++ was always a kind of headache for me, since it always entails big difficulties. BTW, I have studied how people deal with need to create a vector of polimorhic objects, they either use shared_ptr instead or use raw pointers or implement their own version of vector allowing for movable-only semantics of unique_ptr and making the smart pointer from raw one passed in constructor inside.
Seems I still need to define a base class for BPEntry, as they have very similar interface.
You may, but it is not required. The good thing is that you do not need to use virtual functions in BPEntry, as they would be compile-time defined.
I have studied how people deal with need to create a vector of polimorhic objects
Our case is not the good example of container of polymorphic objects, as they have the same type. Also, since vector of pointers has lower data density, we break spatial locality and performance.
The good thing is that you do not need to use virtual functions
And what to use then? I thought I will provide interface-implementation structure like:
class BPEntry : protected Log {
public:
/* prediction */
virtual bool isTaken() const = 0;
virtual addr_t target() const = 0;
/* update */
virtual void update( bool is_actually_taken, addr_t target) = 0;
};
class CompiclatedBPEntry : public BPEntry {
public:
/* prediction */
bool isTaken() const { ... } // implementation
addr_t target() { ... } // implementation
/* update */
void update( bool is_actually_taken, addr_t target) { ... } implementation
};
we break spatial locality and performance
Wow, that is how cache performance might be taken into consideration while writing programs...
BTW, could you please review how the pipeline flush is done now? here are the lines
Firstly, BPEntry should not inherit from Log, this class is needed only for big structures (FE, BP, DC etc.)
Secondly, interface for BPEntry is implicitly defined by BP
interface for BPEntry is implicitly defined by BP methods
That is clear now. Thank you.
Some comments:
That will even probably work faster since no dynamic casts are executed, the only overhead is initialisation.
Not exactly, Irremovable overhead is introduced by virtual calls of BaseBP methods.
That will even probably work faster since no dynamic casts are executed.
Generally, dymanic_cast<>
is required only in exceptional cases, usually when object is transferred from one "code space" to another. If dynamic_cast<>
occurs in simple code, it is usually result of bad design.
Sorry, I haven't finished my message :-)
Thanks a lot for clarifications!
What's more about different types: probably it's better not to use templated BP (just explicitly implement different versions), as all they have different constructors?
Now the types of BP are:
//probably just don't need it?
And the constructor allowing for all the types would look like
create(bool log,
std::string bp_mode,
unsigned int size_in_entries,
unsigned int ways,
unsigned short prediction_bits,
unsigned short prediction_level,
unsigned short branch_ip_size_in_bits)
as all they have different constructors?
Why? We have only 3 arguments to BP constructor (plus "mode" argument passed to Factory)
Prediction bits and prediction level are BPEntry-specific static constants.
The multilevel predictor allows us to use any number of prediction pits as well as any prediction level, and so one might want to test it with different parameters. If not, then which constants should be set?
It's only
Multi-ilevel multi-bit adaptive predictor
what complicates the things, actually.
We agreed to have only 5 modes now, right? (Oh, I see, I meant two-level predictor)
That's OK then, I was simply mislead by that misprint)
Any updates?
I'm running out of time due to my studies, so quite little time could spend to write code yet. Today finished rewriting the pipeline in previously used way (with minor changes to current structure, but with branch prediction functionality, so as not to go ahead with classes etc.) and started to create fabric class for BPU. I doesn't look like complicated task, as it just requires moving the parts of code, splitting the parts into different classes and so on. Tomorrow morning I will have 3-4 hours, so I will try to complete the ongoing work (and apply the fixes you suggested like creating different const reading method in cache tag array, checking whether the jump executed in a different way within FuncInstr and so on).
Thanks for the update.
I suggest you to merge with the latest version — I slightly changed configuration processing. The only thing you need to do is to add these lines to perf_sim.cpp:
namespace config {
static Value<std::string> bp_mode = // ..
static Value<uint32> bp_size = // ..
static Value<uint32> bp_ways = // ..
}
// constructing of PerfSim:
BPFactory bp_factory;
bp = bp_factory.create( config::bp_mode, config::bp_size, config::bp_ways);
(and BP bits, sure)
I suggest you to merge with the latest version — I slightly changed configuration processing.
Yeah, I'm always tracking your updates and rebasing my local branch. Thank you for updates made, each one makes it more convenient to work with the code.
(and BP bits, sure)
What do you mean by that? branch_ip_size_in_bits
? Or simply the variable determining whether 1-bit predictor or 2-bit predictor would be chosen? Probably it is determined by the bp_mode
already?
(we have only five predictors now,
I have simply named the corresponding modes
static_always_taken
, static_backward_jumps
, dynamic_one_bit
, dynamic_two_bit
, adaptive_two_level
)
branch_ip_size_in_bits?
yes
Ok, thanks
@pavelkryukov, could you please have a look at changes in cache_tag_array? BPU needs to know way number to update the data, so not only the hit/miss
bool flag should be returned, but also the way number. Is it OK to do it by passing the plain pointer as an argument? I think, creating a special struct like
{
bool is_hit;
unsigned int way;
}
or returning an std::tuple<bool, way>
is an overhead here, and returning something like -1
as a way number to represent a miss is particularly unclear way.
Here are the updates.
I've merged your changes, they look good
Is it OK to do it by passing the plain pointer as an argument?
Let's keep it for now — we will revise it when IC and DC models are delivered.
Thank you, I'm implementing different BP entries now, will report when finish and test.
Please provide new IPC results for all traces when you ask for final pull request
Also, the name of addr_t should be Addr.
using Addr = uint32;
Also, the name of addr_t should be Addr.
Changed.
I'm gradually approaching to finish, and, surprisingly, the questions have arisen with static predictors.
Prediction always taken
doesn't obviously need any info from cache (and thus, as far as I understand, is called static
-- it doesn't need any dynamic info acquired during execution). So I should explicitly specialise the templated BP for BPEntryAlwaysTaken
class, right?
Prediction backward jumps
need to know the target of the instruction to operate. Should it be stored in the cache? If yes, it is simple, but then I don't understand, why it is called static...
Currently the code for other predictors look like this:
/* prediction */
Direction getDirection( Addr PC) final
{
unsigned int way;
if ( tags.read_no_touch( PC, &way)) // hit
return data[ way][ set(PC)].getDirection();
return false;
}
addr_t getTarget( Addr PC) final
{
if ( tags.read_no_touch( PC, &way)) // hit
return data[ way][ set(PC)].getTarget();
else // miss
return PC + 4;
}
/* update */
void update( Direction direction,
addr_t branch_ip,
addr_t target = NO_VAL32) final
{
unsigned int way;
if ( !tags.read( PC, &way)) // miss
tags.write( branch_ip, &way); // add new entry to cache
data[ way][ set( branch_ip)].update( direction, target);
}
So I should explicitly specialise the templated BP for BPEntryAlwaysTaken class, right?
yes
Should it be stored in the cache?
Yes. Direction is predicted static, target isn't. Later we are going to split BTB and BP to the separate units, so it would matter
Prediction always taken doesn't obviously need any info from cache.
What about target?
I also suggest to rename getDirection to "is_taken" and use boolean variables instead of Direction enum – we are not going to have 3rd direction type
What about target?
Yeah, of course it should be stored just like with backward jumps predictor.
Thank you!
I'm going to have a flight in ten minutes, so can't be online anymore till departure. Thanks for so quick response -- I have something to do during the flight!)
I also suggest to rename getDirection to "is_taken" and use boolean variables instead of Direction enum – we are not going to have 3rd direction type
OK, that's simpler, of course.
So, where is the code? Deadline was yesterday...
I'm sorry for being so late with it, just pushed the code, but I haven't tested it yet and there are a couple of TODO's in it, I am working with it now and fixing mistakes done.
Applied fixes you suggested. Now testing and fixing major implementation errors (as now even unit-tests are failed)
bpu.h
:
/* update */
void update( bool is_taken,
Addr branch_ip,
Addr target = NO_VAL32) final
{
unsigned int way;
if ( !tags.read( branch_ip, &way)) // miss
tags.write( branch_ip, &way); // add new entry to cache
data[ way][ set( branch_ip)].update( is_taken, target);
}
If there is a cache miss, new entry will be ill-formed (old entry is updated with new data). Should be:
/* update */
void update( bool is_taken,
Addr branch_ip,
Addr target = NO_VAL32) final
{
unsigned int way;
if ( !tags.read( branch_ip, &way)) // miss
{
tags.write( branch_ip, &way); // add new entry to cache
data[ way][ set( branch_ip)].reset();
}
data[ way][ set( branch_ip)].update( is_taken, target);
}
return std::unique_ptr<BaseBP>{ std::make_unique<BP<T>>( size_in_entries,
ways,
branch_ip_size_in_bits)};
It can be simpler:
return std::unique_ptr<BaseBP>( new BP<T>( size_in_entries, ways, branch_ip_size_in_bits));
/* acquire set number from address */
inline unsigned int set( Addr addr)
{
return addr & set_mask;
}
That method can be called from CacheTagArray
bool isTaken() const { return state; }
Better to use enum values explicitly:
bool is_taken() const { return state == T; }
init_ports();
BPFactory bp_factory;
bp = bp_factory.create( config::bp_mode, config::bp_size, config::bp_ways);
I suggest to keep init_ports()
in the very bottom of PerfMIPS constructor, so there is no need to think about the order if more units are added
Then, if you received stall, old PC is used, if you had not, use new PC and update it.
Your implementation updates PC before stall checks, seems it causes problems
if ( is_flush)
{
/* ignoring the upcoming instruction as it is invalid */
rp_decode_2_execute->read( &instr, cycle);
sout << "flush\n";
return;
}
I suppose you have to "validate" destination register of this instruction.
So, it's better, than
?