Register instruction in fetch stage before using in decoder

This PR proposes to register the fetched instruction in fetch stage before using it in decoder. This allows us to use bigger memories without timing violations, as the net delay between the memory and the decoder is considerable.

The initial implementation drops support for compressed instructions. Preliminary results:

	Baseline	Registered
Setup	-0.079	0.015
LUTs	3118	3133
FFs	1353	1453
CoreMark	202	184

A small toll to pay for a little bit more headroom in timing. Maybe with this change we can re-add the branch prediction and gain performance?

gaph-pucrs / RS5

Register instruction in fetch stage before using in decoder #31