aot compiler: use larger alignment for load/store when possible

consider the folling wasm module:

(module
  (func (export "foo")
    i32.const 0x104
    i32.const 0x12345678
    i32.store
  )
  (memory 1 1)
)

while the address (0x104) is perfectly aligned for i32.store, as our aot compiler uses 1-byte alignment for load/store LLVM IR instructions, it often produces inefficient machine code, especially for alignment-sensitive targets.

for example, the above "foo" function is compiled into the following xtensa machine code.

0000002c <aot_func_internal#0>:
  2c:   004136          entry   a1, 32
  2f:   07a182          movi    a8, 0x107
  32:   828a            add.n   a8, a2, a8
  34:   291c            movi.n  a9, 18
  36:   004892          s8i     a9, a8, 0
  39:   06a182          movi    a8, 0x106
  3c:   828a            add.n   a8, a2, a8
  3e:   ffff91          l32r    a9, 3c <aot_func_internal#0+0x10> (ff91828a <aot_func_internal#0+0xff91825e>)
                        3e: R_XTENSA_SLOT0_OP   .literal+0x8
  41:   004892          s8i     a9, a8, 0
  44:   05a182          movi    a8, 0x105
  47:   828a            add.n   a8, a2, a8
  49:   ffff91          l32r    a9, 48 <aot_func_internal#0+0x1c> (ffff9182 <aot_func_internal#0+0xffff9156>)
                        49: R_XTENSA_SLOT0_OP   .literal+0xc
  4c:   41a890          srli    a10, a9, 8
  4f:   0048a2          s8i     a10, a8, 0
  52:   04a182          movi    a8, 0x104
  55:   828a            add.n   a8, a2, a8
  57:   004892          s8i     a9, a8, 0
  5a:   f01d            retw.n

note that the each four bytes are stored separately using one-byte-store instruction, s8i.

this commit tries to use larger alignments for load/store LLVM IR instructions when possible. with this commit, the above example is compiled into the following machine code, which seems more reasonable to me.

0000002c <aot_func_internal#0>:
  2c:   004136          entry   a1, 32
  2f:   ffff81          l32r    a8, 2c <aot_func_internal#0> (81004136 <aot_func_internal#0+0x8100410a>)
                        2f: R_XTENSA_SLOT0_OP   .literal+0x8
  32:   416282          s32i    a8, a2, 0x104
  35:   f01d            retw.n

Note: this doesn't work well for --xip because aot_load_const_from_table() hides the constness of the value. maybe we need our own mechanism to propagate the constness and the value.

bytecodealliance / wasm-micro-runtime

aot compiler: use larger alignment for load/store when possible #3552