Open jorgensd opened 4 months ago
dolfinx::fem::DofMap::collapse()
is the problem.
Especially, it is the compute_reordering_map
in build_dofmap_data
that is slow (and not scalable with polynomial order).
I think we should turn of data reordering in serial. Running:
from mpi4py import MPI
from dolfinx import mesh, fem, la, log
from basix.ufl import element, mixed_element
import time
# log.set_log_level(log.LogLevel.INFO)
N = 5
P = 4
# Create square quad mesh
start = time.perf_counter()
domain = mesh.create_unit_cube(MPI.COMM_WORLD, N, N, N, mesh.CellType.hexahedron)
Ve = element("Lagrange", domain.basix_cell(), P, shape=(domain.geometry.dim,))
Qe = element("Lagrange", domain.basix_cell(), P)
W_el = mixed_element([Ve, Qe])
W = fem.functionspace(domain, W_el)
print(f"Meshing and creating mixed space: {time.perf_counter()-start}")
start2 = time.perf_counter()
V, WV_map = W.sub(0).collapse()
print(f"Collapsing V: {time.perf_counter()-start2}")
start3 = time.perf_counter()
Q, WQ_map = W.sub(1).collapse()
print(f"Collapsing Q: {time.perf_counter()-start3}")
returns the following in serial:
Meshing and creating mixed space: 0.013051556999926106
Collapsing V: 7.033872703999805
Collapsing Q: 0.0023294779998650483
and
Meshing and creating mixed space: 0.020997950000037235
Collapsing V: 0.70820857300032
Collapsing Q: 0.003953482000270014
Meshing and creating mixed space: 0.020988049000152387
Collapsing V: 0.7082019069998751
Collapsing Q: 0.003954929999963497
with 2 procs, i.e. a 10 x speedup! Three processes:
Meshing and creating mixed space: 0.02864133799994306
Collapsing V: 0.34778579499970874
Collapsing Q: 0.003480805999970471
Meshing and creating mixed space: 0.027170727000338957
Collapsing V: 0.34781397200003994
Collapsing Q: 0.0034789139999702456
Meshing and creating mixed space: 0.03079914899990399
Collapsing V: 0.3478091659999336
Collapsing Q: 0.0034698189997470763
I still think it is remarkable that creating the collapsed space is more than 10 x more expensive than creating the mixed space itself. Any comments @chrisrichardson @mscroggs @garth-wells ?
We shouldn't turn off re-ordering in serial.
Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?
We shouldn't turn off re-ordering in serial.
Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?
It is prior to 0.7 release. Haven’t tested it for 0.6 yet
I don't think mixed topology changes would affect this.
list_timings()
reveals:
GPS: create_level_structure | 4351 0.001285 5.590000
Gibbs-Poole-Stockmeyer ordering | 2 2.885000 5.770000
which seems strange. Looks like some kind of internal issue with the reordering. GPS is only called twice, but builds the level structure thousands of times.
It is present in 0.6 as well: P=3
Meshing and creating mixed space: 0.5712962300000015
Collapsing V: 0.7795458059999874
Collapsing Q: 0.0005138619999911498
P=4
Meshing and creating mixed space: 0.6977521319999767
Collapsing V: 7.012910768000012
Collapsing Q: 0.0005572910000068987
I added a line in DofMap.cpp
:
if (!reorder_fn)
{
reorder_fn = [](const graph::AdjacencyList<std::int32_t>& g)
{
double nnz_per_row
= g.offsets().back() / (double)(g.offsets().size() - 1);
spdlog::info("DofMap collapse: reordering graph with {} nnz per row",
nnz_per_row);
return graph::reorder_gps(g);
};
}
[2024-06-26 11:30:44.174] [info] DofMap collapse: reordering graph with 106.171875 nnz per row
so it is quite a dense graph
@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with -np 2
).
I think you will then find it is not much different.
We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.
@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with
-np 2
). I think you will then find it is not much different.
As in not improving further?
We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.
I think one of the main things could Also be the block size, as you can see that an unblocked (scalar) element is magnitudes of order faster than its vector counterpart.
Summarize the issue
Collapsing a blocked subspace is orders of magnitude slower than collapsing the scalar version and increases dramatically with increase of polynomial order.
First reported at: https://fenicsproject.discourse.group/t/collapsing-mixed-space-takes-unreasonably-long/15071
How to reproduce the bug
Run MWE below
Minimal Example (Python)
Output (Python)
Version
main branch
DOLFINx git commit
9f33dabc270b8e3c31e815515ecac5eb2a5414b0
Installation
Docker using:
ghcr.io/fenics/dolfinx/dolfinx:nightly
Additional information
No response