FEniCS / dolfinx

Next generation FEniCS problem solving environment
https://fenicsproject.org
GNU Lesser General Public License v3.0
778 stars 181 forks source link

Performance regression in collapsing blocked subspace #3282

Open jorgensd opened 4 months ago

jorgensd commented 4 months ago

Summarize the issue

Collapsing a blocked subspace is orders of magnitude slower than collapsing the scalar version and increases dramatically with increase of polynomial order.

First reported at: https://fenicsproject.discourse.group/t/collapsing-mixed-space-takes-unreasonably-long/15071

How to reproduce the bug

Run MWE below

Minimal Example (Python)

from mpi4py import MPI

from dolfinx import mesh, fem
from basix.ufl import element, mixed_element
import time

N = 5
P = 3

# Create square quad mesh
start = time.perf_counter()
domain = mesh.create_unit_cube(MPI.COMM_WORLD, N, N, N, mesh.CellType.hexahedron)
Ve = element("Lagrange", domain.basix_cell(), P, shape=(domain.geometry.dim,))
Qe = element("Lagrange", domain.basix_cell(), P)
W_el = mixed_element([Ve, Qe])
W = fem.functionspace(domain, W_el)
print(f"Meshing and creating mixed space: {time.perf_counter()-start}")
start2 = time.perf_counter()

V, WV_map = W.sub(0).collapse()
print(f"Collapsing V: {time.perf_counter()-start2}")
start3 = time.perf_counter()

Q, WQ_map = W.sub(1).collapse()
print(f"Collapsing Q: {time.perf_counter()-start3}")

Output (Python)

Meshing and creating mixed space: 0.00762048000001414
Collapsing V: 0.8559996170000659
Collapsing Q: 0.0018857679999655375

Version

main branch

DOLFINx git commit

9f33dabc270b8e3c31e815515ecac5eb2a5414b0

Installation

Docker using: ghcr.io/fenics/dolfinx/dolfinx:nightly

Additional information

No response

jorgensd commented 4 months ago

dolfinx::fem::DofMap::collapse() is the problem. Especially, it is the compute_reordering_map in build_dofmap_data that is slow (and not scalable with polynomial order).

jorgensd commented 4 months ago

I think we should turn of data reordering in serial. Running:

from mpi4py import MPI

from dolfinx import mesh, fem, la, log
from basix.ufl import element, mixed_element
import time

# log.set_log_level(log.LogLevel.INFO)
N = 5
P = 4

# Create square quad mesh
start = time.perf_counter()
domain = mesh.create_unit_cube(MPI.COMM_WORLD, N, N, N, mesh.CellType.hexahedron)
Ve = element("Lagrange", domain.basix_cell(), P, shape=(domain.geometry.dim,))
Qe = element("Lagrange", domain.basix_cell(), P)
W_el = mixed_element([Ve, Qe])
W = fem.functionspace(domain, W_el)
print(f"Meshing and creating mixed space: {time.perf_counter()-start}")
start2 = time.perf_counter()
V, WV_map = W.sub(0).collapse()
print(f"Collapsing V: {time.perf_counter()-start2}")

start3 = time.perf_counter()

Q, WQ_map = W.sub(1).collapse()
print(f"Collapsing Q: {time.perf_counter()-start3}")

returns the following in serial:

Meshing and creating mixed space: 0.013051556999926106
Collapsing V: 7.033872703999805
Collapsing Q: 0.0023294779998650483

and

Meshing and creating mixed space: 0.020997950000037235
Collapsing V: 0.70820857300032
Collapsing Q: 0.003953482000270014
Meshing and creating mixed space: 0.020988049000152387
Collapsing V: 0.7082019069998751
Collapsing Q: 0.003954929999963497

with 2 procs, i.e. a 10 x speedup! Three processes:

Meshing and creating mixed space: 0.02864133799994306
Collapsing V: 0.34778579499970874
Collapsing Q: 0.003480805999970471
Meshing and creating mixed space: 0.027170727000338957
Collapsing V: 0.34781397200003994
Collapsing Q: 0.0034789139999702456
Meshing and creating mixed space: 0.03079914899990399
Collapsing V: 0.3478091659999336
Collapsing Q: 0.0034698189997470763

I still think it is remarkable that creating the collapsed space is more than 10 x more expensive than creating the mixed space itself. Any comments @chrisrichardson @mscroggs @garth-wells ?

garth-wells commented 4 months ago

We shouldn't turn off re-ordering in serial.

Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?

jorgensd commented 4 months ago

We shouldn't turn off re-ordering in serial.

Can someone bisect to find out when the regression was introduced? @chrisrichardson could it be related to recent changes for mixed topology?

It is prior to 0.7 release. Haven’t tested it for 0.6 yet

chrisrichardson commented 4 months ago

I don't think mixed topology changes would affect this. list_timings() reveals:

GPS: create_level_structure                                                 |  4351  0.001285  5.590000
Gibbs-Poole-Stockmeyer ordering                                             |     2  2.885000  5.770000

which seems strange. Looks like some kind of internal issue with the reordering. GPS is only called twice, but builds the level structure thousands of times.

jorgensd commented 4 months ago

It is present in 0.6 as well: P=3

Meshing and creating mixed space: 0.5712962300000015
Collapsing V: 0.7795458059999874
Collapsing Q: 0.0005138619999911498

P=4

Meshing and creating mixed space: 0.6977521319999767
Collapsing V: 7.012910768000012
Collapsing Q: 0.0005572910000068987
chrisrichardson commented 4 months ago

I added a line in DofMap.cpp:

 if (!reorder_fn)
  {
    reorder_fn = [](const graph::AdjacencyList<std::int32_t>& g)
    {
      double nnz_per_row
          = g.offsets().back() / (double)(g.offsets().size() - 1);

      spdlog::info("DofMap collapse: reordering graph with {} nnz per row",
                   nnz_per_row);

      return graph::reorder_gps(g);
    };
  }

[2024-06-26 11:30:44.174] [info] DofMap collapse: reordering graph with 106.171875 nnz per row

so it is quite a dense graph

chrisrichardson commented 4 months ago

@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with -np 2). I think you will then find it is not much different.

garth-wells commented 4 months ago

We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.

jorgensd commented 4 months ago

@jorgensd Re: the speedup in parallel - you should do a weak scaling (e.g. double the mesh size with -np 2). I think you will then find it is not much different.

As in not improving further?

jorgensd commented 4 months ago

We probably build a sparsity graph for the dofs, which isn't a good use of compute effort for high-order elements. We should probably create a graph of vertices, facets, etc, and then lay the dofs for each entity 'on top'.

I think one of the main things could Also be the block size, as you can see that an unblocked (scalar) element is magnitudes of order faster than its vector counterpart.