Feedback on the accelerator interface

Hi Noam,

Thank you very much for writing up the proposal for the accelerator interface.

I have some questions and suggestion that you maybe can take into account:

ready/valid protocol

The initiator asserts valid. The assertion of valid must not depend on ready.
Once valid has been asserted all data must remain stable.
The receiver asserts ready whenever it is ready to receive the transaction.
When both valid and ready are high the transaction is successful.

Can you make it explicit that a default ready is allowed (i.e. ready = 1 while valid = 0) and that ready can be retracted at any time. You state that the handshake is according to the AMBA standard. I assume you are referring to AXI here. In AXI it is not allowed for outputs to combinatorially depend on inputs. In this case I would expect that we would actually allow ready to combinatorially depend on valid; if you agree can you please remove the 'AMBA standard' bit and state the allowed dependency explicitly?

Request channel (q)

I don't have an alternative naming proposal yet, but calling the request channel 'q channel' is going to lead to a lot of confusion as that is an ARM standard.

q_data_op

Can you state that whether q_data_op is a 32-bit instruction or a 16-bit instruction will be based on the existing RISC-V definition, so q_data_op[1:0] specify whether the instruction is 16-bit or not and q_data_op[6:2] further differentiate the instruction width as specified in the RISC-V unprivileged spec. (You might want to add an InstrWidth parameter as well.)

Please also prescribe that instructions which are shorter than the width of q_data_op will always be present in the LSBs and that the MSBs should be ignored in that case. For example for 'compressed instructions' are always present in q_data_op[15:0] and if q-data_op[1:0] != 2'b11, then q_data_op[31:16] shall be ignored (it might not contain bits related to the next instruction).

Match indicator

The request channel needs to have a 'match' signal flowing from request channel target to request channel initiator (with validity signaled with 'ready'). The signal will indicate whether a connected accelerator can/will handle the provided q_data_op. (If no match is received by the RISC-V and also the RISC-V itself cannot decode the instruction, then it will be classified as an illegal instruction). I think that we should explicitly allow that 'match' combinatorially can depend on q_data_op/valid.

A/B/C operand

As discussed there is a need for the RISC-V core to know which operands are actually required for a specific accelerated instruction. The accelerator shall signal this via 3 sepearate signals with same validity/timing as above 'match' signal. I think that we should explicitly allow that these 3 signals combinatorially can depend on q_data_op/valid.

Operand destination

Just like you defined the 'operand origin', can you please add a section on 'operand destination' (and this needs to be defined for both compressed and uncompressed instructions). On top of the obvious choice of q_data_op[11:7] as the destination for 32-bit instructions, we would like to see the ability for dual-register writeback, similar to how this is done in the P extension (e.g. for even destination registers other than x0, a dual write back can be performed to Xn and Xn+1). The accelerator can signal its need for a dual writeback by another signal flowing from request channel target to request channel initiator (with validity signaled with 'ready').

Hi Arjan, Thanks a lot for your feedback!

ready/valid protocol
The initiator asserts valid. The assertion of valid must not depend on ready.
Once valid has been asserted all data must remain stable.
The receiver asserts ready whenever it is ready to receive the transaction.
When both valid and ready are high the transaction is successful.
Can you make it explicit that a default ready is allowed (i.e. ready = 1 while valid = 0) and that ready can be retracted at any time. You state that the handshake is according to the AMBA standard. I assume you are referring to AXI here. In AXI it is not allowed for outputs to combinatorially depend on inputs. In this case I would expect that we would actually allow ready to combinatorially depend on valid; if you agree can you please remove the 'AMBA standard' bit and state the allowed dependency explicitly?

I agree. Thanks.

Request channel (q)

I don't have an alternative naming proposal yet, but calling the request channel 'q channel' is going to lead to a lot of confusion as that is an ARM standard.

I see. We could call them req/rsp or something like that.

q_data_op

Can you state that whether q_data_op is a 32-bit instruction or a 16-bit instruction will be based on the existing RISC-V definition, so q_data_op[1:0] specify whether the instruction is 16-bit or not and q_data_op[6:2] further differentiate the instruction width as specified in the RISC-V unprivileged spec. (You might want to add an InstrWidth parameter as well.)

Please also prescribe that instructions which are shorter than the width of q_data_op will always be present in the LSBs and that the MSBs should be ignored in that case. For example for 'compressed instructions' are always present in q_data_op[15:0] and if q-data_op[1:0] != 2'b11, then q_data_op[31:16] shall be ignored (it might not contain bits related to the next instruction).

I am not sure this makes sense. Here is why:

As far as I know, for compressed RISC-V instructions there always exists a corresponding non-compressed instruction. All the cores I have seen so far, the main decoder, which would emit an offload request, works with the expanded instruction data. So for non accelerator-agnostic cores it would make sense to do the same and offload the expanded instruction to the accelerator interconnect - No need for your proposed change to the spec.
Compressed instructions are tricky to decode. Some decoding needs to be done by the offloading core, to identify the source registers. Unfortunately, there are a few different formats defined for compressed instructions:

For accelerator-agnostic cores there is a problem here. As far as I can tell, there is no way to distinguish instruction formats without knowing about the instruction explicitly (I might be wrong), and therefore no way to know where the source and destination register addresses are located.

You can not offload a compressed instruction from an accelerator-agnostic core without providing more information. The required source and destination registers would need to be supplied by the external pre-decoder.

The easiest way I could think of to implement this would be to expand the compressed instruction in the predecoder and feed the expanded data to the offloading core's decoder. This imposes some additional complexity on the pre-decoder and the requirements for an accelerator-agnostic core, but I think this should be doable with the least amount of overhead to the offloading core.

Again, we do not need to specify q_data_op to be able to carry compressed instructions, as we would expand them prior to offloading

Match indicator

The request channel needs to have a 'match' signal flowing from request channel target to request channel initiator (with validity signaled with 'ready'). The signal will indicate whether a connected accelerator can/will handle the provided q_data_op. (If no match is received by the RISC-V and also the RISC-V itself cannot decode the instruction, then it will be classified as an illegal instruction). I think that we should explicitly allow that 'match' combinatorially can depend on q_data_op/valid.

I do not think this is necessary. An instruction should never be offloaded to an accelerator which can not execute it.

In case of non-agnostic cores, the decoder stage should take care of that. An illegal instruction is one that cannot be offloaded nor executed internally. In the case of agnostic cores, the pre-decoder takes care of this. An analogue to your proposed 'match' signal should be implemented between core and pre-decoder. I am currently drafting a document specifying that interface. The corresponding signal I called offload_accept for now..

A/B/C operand

As discussed there is a need for the RISC-V core to know which operands are actually required for a specific accelerated instruction. The accelerator shall signal this via 3 sepearate signals with same validity/timing as above 'match' signal. I think that we should explicitly allow that these 3 signals combinatorially can depend on q_data_op/valid.

I don't think this is necessary either, because there is another stage of decoding in the subsystem for each accelerator. For any offloaded instruction, the accelerator-local decoder must know which operands are to be used. Unused operand signals should be silently ignored.

In other words, the information carried by the signal you propose is already present in the q_data_op signal.

Again, for the interface between agnostic core and pre-decoder we do need something like it. However, I don't think the core actually needs to know about which source registers are used. Rather the pre-decoder must know which are valid. I am currently drafting an according specification in the above mentioned document.

Operand destination

Just like you defined the 'operand origin', can you please add a section on 'operand destination' (and this needs to be defined for both compressed and uncompressed instructions). On top of the obvious choice of q_data_op[11:7] as the destination for 32-bit instructions, we would like to see the ability for dual-register writeback, similar to how this is done in the P extension (e.g. for even destination registers other than x0, a dual write back can be performed to Xn and Xn+1). The accelerator can signal its need for a dual writeback by another signal flowing from request channel target to request channel initiator (with validity signaled with 'ready').

I'll do that.

Operand destination

Just like you defined the 'operand origin', can you please add a section on 'operand destination' (and this needs to be defined for both compressed and uncompressed instructions). On top of the obvious choice of q_data_op[11:7] as the destination for 32-bit instructions, we would like to see the ability for dual-register writeback, similar to how this is done in the P extension (e.g. for even destination registers other than x0, a dual write back can be performed to Xn and Xn+1). The accelerator can signal its need for a dual writeback by another signal flowing from request channel target to request channel initiator (with validity signaled with 'ready').

I'll do that.

Maybe we should think about this some more: Instead of the scheme you propose, we could also let one (or more) of the source registers double as writeback locations. There are many possibilities how to do that.

Also: In our current scheme we can only write back one result at a time; two simultaneous writebacks from the accelerator would only be possible if we widen the response path. But sequential write-back probably doesn't yield much benefits in many cases.

I think we should discuss that again in the group. - I'll add it to the list of open questions.

Hi Arjan, Thanks a lot for your feedback!
ready/valid protocol
The initiator asserts valid. The assertion of valid must not depend on ready.
Once valid has been asserted all data must remain stable.
The receiver asserts ready whenever it is ready to receive the transaction.
When both valid and ready are high the transaction is successful.
Can you make it explicit that a default ready is allowed (i.e. ready = 1 while valid = 0) and that ready can be retracted at any time. You state that the handshake is according to the AMBA standard. I assume you are referring to AXI here. In AXI it is not allowed for outputs to combinatorially depend on inputs. In this case I would expect that we would actually allow ready to combinatorially depend on valid; if you agree can you please remove the 'AMBA standard' bit and state the allowed dependency explicitly?
I agree. Thanks.

Request channel (q)

I don't have an alternative naming proposal yet, but calling the request channel 'q channel' is going to lead to a lot of confusion as that is an ARM standard.

I see. We could call them req/rsp or something like that.

q_data_op

Can you state that whether q_data_op is a 32-bit instruction or a 16-bit instruction will be based on the existing RISC-V definition, so q_data_op[1:0] specify whether the instruction is 16-bit or not and q_data_op[6:2] further differentiate the instruction width as specified in the RISC-V unprivileged spec. (You might want to add an InstrWidth parameter as well.) Please also prescribe that instructions which are shorter than the width of q_data_op will always be present in the LSBs and that the MSBs should be ignored in that case. For example for 'compressed instructions' are always present in q_data_op[15:0] and if q-data_op[1:0] != 2'b11, then q_data_op[31:16] shall be ignored (it might not contain bits related to the next instruction).

I am not sure this makes sense. Here is why:
1. As far as I know, for compressed RISC-V instructions there always exists a corresponding non-compressed instruction. All the cores I have seen so far, the main decoder, which would emit an offload request, works with the expanded instruction data. So for non accelerator-agnostic cores it would make sense to do the same and offload the expanded instruction to the accelerator interconnect - No need for your proposed change to the spec.

This is not fully true. For illegal or reserved compressed instructions there is no RISC-V specified non-compressed instruction and all RISC-V implementations might handle this scenario in different manners. Therefore we need to feed the compressed instruction to the accelerator and not the uncompressed version (which would be very core specific for illegal/reserved compressed instructions, which are exactly the ones we want to use for possibly assigning to accelerators).

2. Compressed instructions are tricky to decode. Some decoding needs to be done by the offloading core, to identify the source registers. Unfortunately, there are a few different formats defined for compressed instructions:
For accelerator-agnostic cores there is a problem here. As far as I can tell, there is no way to distinguish instruction formats without knowing about the instruction explicitly (I might be wrong), and therefore no way to know where the source and destination register addresses are located.

You can not offload a compressed instruction from an accelerator-agnostic core without providing more information. The required source and destination registers would need to be supplied by the external pre-decoder.

Here I do agree. We could decide to for example state that illegal/reserved compressed instructions have no source or destination registers and as such an accelerator can only use them to for example implement 'hints' or some kind of control instructions (e.g. similar to WFI) that do no depends on read/write operands. Still the accelerator needs to get the uncompressed instruction (not the uncompressed variant as that does not carry any meaning for this scenario).

The easiest way I could think of to implement this would be to expand the compressed instruction in the predecoder and feed the expanded data to the offloading core's decoder. This imposes some additional complexity on the pre-decoder and the requirements for an accelerator-agnostic core, but I think this should be doable with the least amount of overhead to the offloading core.

Again, we do not need to specify q_data_op to be able to carry compressed instructions, as we would expand them prior to offloading

Match indicator

The request channel needs to have a 'match' signal flowing from request channel target to request channel initiator (with validity signaled with 'ready'). The signal will indicate whether a connected accelerator can/will handle the provided q_data_op. (If no match is received by the RISC-V and also the RISC-V itself cannot decode the instruction, then it will be classified as an illegal instruction). I think that we should explicitly allow that 'match' combinatorially can depend on q_data_op/valid.

I do not think this is necessary. An instruction should never be offloaded to an accelerator which can not execute it. In case of non-agnostic cores, the decoder stage should take care of that. An illegal instruction is one that cannot be offloaded nor executed internally. In the case of agnostic cores, the pre-decoder takes care of this. An analogue to your proposed 'match' signal should be implemented between core and pre-decoder. I am currently drafting a document specifying that interface. The corresponding signal I called offload_accept for now..

All my comments in this issue indeed relate to the interface between the Core and the 'Accelerator predecoder'. At that interface you need some kind of match indicator. I now see that you call it 'offload_accept'; I think that might serve the same goal as what I tried to describe here. It would be good to add a table which defines the 'accelerator agnostic core interface' as well.

A/B/C operand

As discussed there is a need for the RISC-V core to know which operands are actually required for a specific accelerated instruction. The accelerator shall signal this via 3 sepearate signals with same validity/timing as above 'match' signal. I think that we should explicitly allow that these 3 signals combinatorially can depend on q_data_op/valid.

I don't think this is necessary either, because there is another stage of decoding in the subsystem for each accelerator. For any offloaded instruction, the accelerator-local decoder must know which operands are to be used. Unused operand signals should be silently ignored.

In other words, the information carried by the signal you propose is already present in the q_data_op signal.

Again, for the interface between agnostic core and pre-decoder we do need something like it. However, I don't think the core actually needs to know about which source registers are used. Rather the pre-decoder must know which are valid. I am currently drafting an according specification in the above mentioned document.

Again my comments were for the 'accelerator agnostic core interface'. I do still think we need these signals. It is not good enough for the accelerator predecoder to know this as the core needs this info to prevent unnecessary cycles and stalls (otherwise it has to assume that 3 read operands are always used, which is much to pessimistic).

Operand destination

Just like you defined the 'operand origin', can you please add a section on 'operand destination' (and this needs to be defined for both compressed and uncompressed instructions). On top of the obvious choice of q_data_op[11:7] as the destination for 32-bit instructions, we would like to see the ability for dual-register writeback, similar to how this is done in the P extension (e.g. for even destination registers other than x0, a dual write back can be performed to Xn and Xn+1). The accelerator can signal its need for a dual writeback by another signal flowing from request channel target to request channel initiator (with validity signaled with 'ready').

I'll do that.

Operand destination

Just like you defined the 'operand origin', can you please add a section on 'operand destination' (and this needs to be defined for both compressed and uncompressed instructions). On top of the obvious choice of q_data_op[11:7] as the destination for 32-bit instructions, we would like to see the ability for dual-register writeback, similar to how this is done in the P extension (e.g. for even destination registers other than x0, a dual write back can be performed to Xn and Xn+1). The accelerator can signal its need for a dual writeback by another signal flowing from request channel target to request channel initiator (with validity signaled with 'ready').

I'll do that.

Maybe we should think about this some more: Instead of the scheme you propose, we could also let one (or more) of the source registers double as writeback locations. There are many possibilities how to do that.

Also: In our current scheme we can only write back one result at a time; two simultaneous writebacks from the accelerator would only be possible if we widen the response path. But sequential write-back probably doesn't yield much benefits in many cases.

I think we should discuss that again in the group. - I'll add it to the list of open questions.

Agreed that we would need simultaneous writeback (as opposed to sequential writeback). The data path only needs to be widened for accelerators that actually use this feature; the related signals can be tied off (and be optimized away) otherwise. If no accelerators use this feature in a given system, then of course the signals can be left out all together.

The Core itself will need to have a parameter so that it can know at compile time whether dual write back should be supported; similarly I would propose a compile-time parameter for the Core to know whether support for a 3rd register file read port is required or not.

ganoam / accelerator-interface

Feedback on the accelerator interface #3