Open haampie opened 6 years ago
I agree. To do this, we should support both, build-in arrays and static arrays. It would be very interesting to measure and compare performance of functions.
The only problem is that static arrays are immutable, so the eval_basis!
functions cannot mutate the static array. I didn't really look at how eval_basis!
was being used, so therefore I closed the issue in the end.
Yes but it's not a big problem, we can also generate eval_basis
and eval_dbasis
, here is the performance penalty. I actually tried to generate tuple versions but for some reason didn't get a good speedup and then forgot whole thing, maybe StaticArrays.jl
would do the job better. What we need in practice is to calculate determinant of Jacobian and gradient of some quantity (displacement, temperature) with respect to X. I have been trying to optimize the performance of these operations here. I'm quite sure that we easily get more speed for this but I need some help to make it better. Here is how this is used in other code at the moment.
https://github.com/JuliaFEM/JuliaFEM.jl/blob/master/src/problems_elasticity.jl#L140
I think using https://github.com/KristofferC/Tensors.jl/blob/master/src/Tensors.jl would be better for this since they come with many convenience functions that you want to use in FEM. It also has real support for operations with 4-th order tensors and 2-nd and avoids using e.g. Voigt notation. I can give it a try.
Tensors also tend to have better performance
Scalar product between two rank 2 tensors
# Tensors.jl
julia> @btime $t2 ⋅ $t2
4.397 ns (0 allocations: 0 bytes)
# StaticArrays.jl
julia> @btime $s2 * $s2
10.063 ns (0 allocations: 0 bytes)
4th order tensor double contracted with second order
# Tensors.jl
julia> @btime $t4 ⊡ $t2
9.080 ns (0 allocations: 0 bytes)
# StaticArrays.jl
julia> @btime $s4 * $s2
11.577 ns (0 allocations: 0 bytes)
How did you achieve this exactly? Shouldn't StaticArrays.jl be on par?
For example:
Tensors.jl
julia> @code_native t2 ⋅ t2
.section __TEXT,__text,regular,pure_instructions
vmovupd (%rsi), %xmm0
vmovsd 16(%rsi), %xmm1 ## xmm1 = mem[0],zero
vinsertf128 $1, %xmm1, %ymm0, %ymm0
vmovupd 24(%rsi), %xmm1
vmovsd 40(%rsi), %xmm2 ## xmm2 = mem[0],zero
vinsertf128 $1, %xmm2, %ymm1, %ymm1
vmovupd 48(%rsi), %xmm2
vmovsd 64(%rsi), %xmm3 ## xmm3 = mem[0],zero
vinsertf128 $1, %xmm3, %ymm2, %ymm2
vbroadcastsd (%rdx), %ymm3
vmulpd %ymm3, %ymm0, %ymm3
vbroadcastsd 8(%rdx), %ymm4
vfmadd213pd %ymm3, %ymm1, %ymm4
vbroadcastsd 16(%rdx), %ymm3
vfmadd213pd %ymm4, %ymm2, %ymm3
vbroadcastsd 24(%rdx), %ymm4
vmulpd %ymm4, %ymm0, %ymm4
vbroadcastsd 32(%rdx), %ymm5
vfmadd213pd %ymm4, %ymm1, %ymm5
vbroadcastsd 40(%rdx), %ymm4
vfmadd213pd %ymm5, %ymm2, %ymm4
vbroadcastsd 48(%rdx), %ymm5
vmulpd %ymm5, %ymm0, %ymm0
vbroadcastsd 56(%rdx), %ymm5
vfmadd213pd %ymm0, %ymm1, %ymm5
vbroadcastsd 64(%rdx), %ymm0
vfmadd213pd %ymm5, %ymm2, %ymm0
vbroadcastsd %xmm4, %ymm1
vblendpd $8, %ymm1, %ymm3, %ymm1 ## ymm1 = ymm3[0,1,2],ymm1[3]
vmovupd %ymm1, (%rdi)
vextractf128 $1, %ymm0, %xmm1
vinsertf128 $1, %xmm0, %ymm0, %ymm0
vpermpd $233, %ymm4, %ymm2 ## ymm2 = ymm4[1,2,2,3]
vblendpd $12, %ymm0, %ymm2, %ymm0 ## ymm0 = ymm2[0,1],ymm0[2,3]
vmovupd %ymm0, 32(%rdi)
vmovlpd %xmm1, 64(%rdi)
movq %rdi, %rax
vzeroupper
retq
nopw %cs:(%rax,%rax)
StaticArrays.jl:
julia> @code_native s2 * s2
.section __TEXT,__text,regular,pure_instructions
vmovsd (%rdx), %xmm0 ## xmm0 = mem[0],zero
vmovsd 8(%rdx), %xmm7 ## xmm7 = mem[0],zero
vmovsd 16(%rdx), %xmm6 ## xmm6 = mem[0],zero
vmovsd 16(%rsi), %xmm11 ## xmm11 = mem[0],zero
vmovsd 40(%rsi), %xmm12 ## xmm12 = mem[0],zero
vmovsd 64(%rsi), %xmm9 ## xmm9 = mem[0],zero
vmovsd 24(%rdx), %xmm3 ## xmm3 = mem[0],zero
vmovupd (%rsi), %xmm4
vmovupd 8(%rsi), %xmm10
vmovhpd (%rsi), %xmm11, %xmm5 ## xmm5 = xmm11[0],mem[0]
vinsertf128 $1, %xmm5, %ymm4, %ymm5
vunpcklpd %xmm3, %xmm0, %xmm1 ## xmm1 = xmm0[0],xmm3[0]
vmovddup %xmm0, %xmm0 ## xmm0 = xmm0[0,0]
vinsertf128 $1, %xmm1, %ymm0, %ymm0
vmulpd %ymm0, %ymm5, %ymm0
vmovsd 32(%rdx), %xmm5 ## xmm5 = mem[0],zero
vmovupd 24(%rsi), %xmm8
vmovhpd 24(%rsi), %xmm12, %xmm1 ## xmm1 = xmm12[0],mem[0]
vinsertf128 $1, %xmm1, %ymm8, %ymm1
vunpcklpd %xmm5, %xmm7, %xmm2 ## xmm2 = xmm7[0],xmm5[0]
vmovddup %xmm7, %xmm7 ## xmm7 = xmm7[0,0]
vinsertf128 $1, %xmm2, %ymm7, %ymm2
vmulpd %ymm2, %ymm1, %ymm1
vaddpd %ymm1, %ymm0, %ymm1
vmovsd 40(%rdx), %xmm0 ## xmm0 = mem[0],zero
vmovhpd 48(%rsi), %xmm9, %xmm2 ## xmm2 = xmm9[0],mem[0]
vmovupd 48(%rsi), %xmm13
vinsertf128 $1, %xmm2, %ymm13, %ymm2
vunpcklpd %xmm0, %xmm6, %xmm7 ## xmm7 = xmm6[0],xmm0[0]
vmovddup %xmm6, %xmm6 ## xmm6 = xmm6[0,0]
vinsertf128 $1, %xmm7, %ymm6, %ymm6
vmulpd %ymm6, %ymm2, %ymm2
vaddpd %ymm2, %ymm1, %ymm14
vmovsd 48(%rdx), %xmm1 ## xmm1 = mem[0],zero
vmovsd 56(%rdx), %xmm2 ## xmm2 = mem[0],zero
vmovsd 64(%rdx), %xmm6 ## xmm6 = mem[0],zero
vinsertf128 $1, %xmm4, %ymm10, %ymm4
vmovddup %xmm3, %xmm3 ## xmm3 = xmm3[0,0]
vmovddup %xmm1, %xmm7 ## xmm7 = xmm1[0,0]
vinsertf128 $1, %xmm7, %ymm3, %ymm3
vmulpd %ymm3, %ymm4, %ymm3
vmovupd 32(%rsi), %xmm4
vinsertf128 $1, %xmm8, %ymm4, %ymm4
vmovddup %xmm5, %xmm5 ## xmm5 = xmm5[0,0]
vmovddup %xmm2, %xmm7 ## xmm7 = xmm2[0,0]
vinsertf128 $1, %xmm7, %ymm5, %ymm5
vmulpd %ymm5, %ymm4, %ymm4
vaddpd %ymm4, %ymm3, %ymm3
vmovupd 56(%rsi), %xmm4
vinsertf128 $1, %xmm13, %ymm4, %ymm4
vmovddup %xmm0, %xmm0 ## xmm0 = xmm0[0,0]
vmovddup %xmm6, %xmm5 ## xmm5 = xmm6[0,0]
vinsertf128 $1, %xmm5, %ymm0, %ymm0
vmulpd %ymm0, %ymm4, %ymm0
vaddpd %ymm0, %ymm3, %ymm0
vmulsd %xmm1, %xmm11, %xmm1
vmulsd %xmm2, %xmm12, %xmm2
vaddsd %xmm2, %xmm1, %xmm1
vmulsd %xmm6, %xmm9, %xmm2
vaddsd %xmm2, %xmm1, %xmm1
vmovupd %ymm14, (%rdi)
vmovupd %ymm0, 32(%rdi)
vmovsd %xmm1, 64(%rdi)
movq %rdi, %rax
vzeroupper
retq
nop
Perhaps StaticArrays should use muladds
?
I like the idea of using Tensors.jl. The source code should be as close to mathematical equations found from scientific articles as possible. Then the domain specific research can understand the source code as easily as possible.
It might be useful to use non-allocating static arrays all over the place