Remove useless parameters for get_mask, which is misleading
For unit-stride ld/st, perform a dummy memory access to obtain host addr. If host TLB miss or crossing page, then we fall back to the slow path. This avoids redundant address translation.
Implement masked ld/st with bit operations instread of branches, which makes code easier to vectorize
Always prefer bounded loop to make it easier to vectorize
Allow to check fast vse with slow path and store commit difftest
This patch improves the simualtion speed of vectorized h264_sss by 3x, improves the simualtion speed of early 2G instructions of vectorized h264_sss by 9x.