Open ahojukka5 opened 7 years ago
Interesting findings..
function get_integration_points_test_1(element::Element{Tet10}, ::Type{Val{3}})
weights = (-2.0/15.0, 3.0/40.0, 3.0/40.0, 3.0/40.0, 3.0/40.0)
coords = (
(1.0/4.0, 1.0/4.0, 1.0/4.0),
(1.0/6.0, 1.0/6.0, 1.0/6.0),
(1.0/6.0, 1.0/6.0, 1.0/2.0),
(1.0/6.0, 1.0/2.0, 1.0/6.0),
(1.0/2.0, 1.0/6.0, 1.0/6.0)
)
return zip(weights, coords)
end
27-Mar 10:36:08:INFO:root:1 problems, 200000 elements/problem -> 200000 operations in 27.406 seconds, 7297.772 operations per second.
27-Mar 10:36:10:INFO:root:Threading ON, using 16 threads.
27-Mar 10:36:12:INFO:root:Using 16 threads and 1 workers: 1 problems, 200000 elements/problem -> 200000 operations in 2.104 seconds, 95071.725 operations per second.
27-Mar 10:36:12:INFO:root:Speedup: 13.02749921459311 x
function get_integration_points_test_2(element::Element{Tet10}, ::Type{Val{3}})
weights = [-2.0/15.0, 3.0/40.0, 3.0/40.0, 3.0/40.0, 3.0/40.0]
coords = Vector{Float64}[
[1.0/4.0, 1.0/4.0, 1.0/4.0],
[1.0/6.0, 1.0/6.0, 1.0/6.0],
[1.0/6.0, 1.0/6.0, 1.0/2.0],
[1.0/6.0, 1.0/2.0, 1.0/6.0],
[1.0/2.0, 1.0/6.0, 1.0/6.0]]
return zip(weights, coords)
end
27-Mar 10:39:07:INFO:root:1 problems, 200000 elements/problem -> 200000 operations in 27.088 seconds, 7383.312 operations per second.
27-Mar 10:39:08:INFO:root:Threading ON, using 16 threads.
27-Mar 10:39:12:INFO:root:Using 16 threads and 1 workers: 1 problems, 200000 elements/problem -> 200000 operations in 3.344 seconds, 59813.263 operations per second.
27-Mar 10:39:12:INFO:root:Speedup: 8.101142401863232 x
In the beginning, 1 core assembly takes
──────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 12555s / 99.9% 1664GiB / 100%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────────
assemble problem 1 12348s 98.4% 12348s 1633GiB 98.2% 1633GiB
read mesh from disk 1 119s 0.95% 119s 14.5GiB 0.87% 14.5GiB
create problem 1 79.0s 0.63% 79.0s 16.2GiB 0.97% 16.2GiB
──────────────────────────────────────────────────────────────────────────────
Now,
──────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 779s / 99.9% 101GiB / 100%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────────
assemble problem 1 607s 78.0% 607s 78.0GiB 77.4% 78.0GiB
read mesh from disk 1 101s 13.0% 101s 13.5GiB 13.4% 13.5GiB
create problem 1 69.8s 8.98% 69.8s 9.30GiB 9.22% 9.30GiB
──────────────────────────────────────────────────────────────────────────────
It's 4077 elements / second for 1 core now, speedup is 20x. Take also look of memory allocations. Memory allocation for assembly dropped 20.93 x, speedup is 20.34. Coincidence? SparseMatrixCOO
takes three vectors, for 4950612 elements needed space is 66.4 GB without taking symmetry into account. So maybe using symmetry we may gain another 2x in assembly time, making -30 GB and -5 minutes in 12.5 million dof model.
Pre-allocate whole sparse matrix at beginning, make assemble more type stable using @code_warntype
:
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 492s / 99.9% 101GiB / 100%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────────
assemble problem 1 318s 64.6% 318s 77.9GiB 77.3% 77.9GiB
read mesh from disk 1 102s 20.7% 102s 13.5GiB 13.4% 13.5GiB
create problem 1 72.1s 14.7% 72.1s 9.30GiB 9.23% 9.30GiB
──────────────────────────────────────────────────────────────────────────────
Test assembly of 100000 elements:
3.755976 seconds (900.21 k allocations: 83.201 MB, 2.04% gc time)
Still lot of allocations
We need to get threading working for FE assembler. Some preliminary results follow.
Big model: nodes 4189147, elements 2475306. It's 12.5 million dofs. So we need to assemble maybe 5 million elements in a reasonable time. Simple performance test, create model with 10000 elements and assemble:
We are not getting expected scaling. Here is a test code to study problems arising when using multithreading:
Lessons learned so far:
operation()
, we get 10-15 x scalingoperation()
, e.g. one commonproblem
, performance is reduced remarkably (common problem, speedup for 16 threads is 2x, sub problems, speedup for 16 threads 12x)So basically we need to check scalability of every elementary operation...