I noticed this only recently; for the innermost for loop, a performance practice is to separate the loops into two parts: 1) border and 2) inner area. The benefits are 1) we can remove unnecessary bounds check for the inner area, and thus 2) we can enable @simd for such simple operations.
As mentioned here by @johnnychen94: