Annoyingly, RISC-V is really inconvenient when we have to deal with misaligned loads/stores. LLVM by default generates very inefficient code which loads every byte separately and combines them into a 32/64 bit integer. The ld instruction "may" support misaligned loads and for Linux user-space it's even guaranteed, but it can be (and IIUC often in practice is) "extremely slow", so we should not rely on it while writing performant code.
After asking around, it looks like this mess is here to stay, so we have no choice but to work around it. To do that this PR introduces two separate paths for loading block data: aligned and misaligned. The aligned path should be the most common one. In the misaligned path we have to rely on inline assembly since we have to load some bits outside of the block.
Additionally, this PR makes inlining in the riscv-zknh backend less aggressive, which makes generated binary code 3-4 times smaller at the cost of one additional branch.
Annoyingly, RISC-V is really inconvenient when we have to deal with misaligned loads/stores. LLVM by default generates very inefficient code which loads every byte separately and combines them into a 32/64 bit integer. The
ld
instruction "may" support misaligned loads and for Linux user-space it's even guaranteed, but it can be (and IIUC often in practice is) "extremely slow", so we should not rely on it while writing performant code.After asking around, it looks like this mess is here to stay, so we have no choice but to work around it. To do that this PR introduces two separate paths for loading block data: aligned and misaligned. The aligned path should be the most common one. In the misaligned path we have to rely on inline assembly since we have to load some bits outside of the block.
Additionally, this PR makes inlining in the
riscv-zknh
backend less aggressive, which makes generated binary code 3-4 times smaller at the cost of one additional branch.Generated assembly for RV64: