Optimize square root instruction

Context

Currently fsqrt.d is one of the most slow instructions that can be called from userspace, for instance it is roughly 3x slower than other floating point instructions, we could improve it. For dapps running untrusted RISC-V code, this function could be abused to make the dapp validation intentionally slower.

Measurements:

fsqrt.d                                        60.309 MIPS    42852 ucycles
fmul.d                                        152.402 MIPS     6964 ucycles
mul                                           494.966 MIPS     6434 ucycles

I added other instructions speed as reference. Also sqrt seems to be taking a large number of microarchitecture cycles.

EDIT: Seems like fdiv.d causes a iterations of 128 loops in uarch due to our 128bit implementation, we could also optimize that.

Possible solutions

Our current implementation is using Newton's method to find the square root, with many iterations. Seems like "Berkeley Softfloat" gets away without for loops, using fast invert square root, possible inspired by the famous Quake's fast invert square root. We could investigate how this is done, removing for loops would be the ideal case for running in microarchitecture.

cartesi / machine-emulator

Optimize square root instruction #221

Context

Possible solutions