SCOREC / EnGPar

dynamic load balancing
http://scorec.github.io/EnGPar/
BSD 3-Clause "New" or "Revised" License
7 stars 4 forks source link

Diffusive balancer stops #34

Open kktach opened 4 years ago

kktach commented 4 years ago

On branch kk_hg_bfs. Mesh is not in pumi-meshes repo but can be copied from /lore/cwsmith/geometries/pumi-meshes/upright/1.6M/2p. Verbosity in engpar::balance was changed to 2.

There are weights (1.7) applied to the mesh entities in process 1, which is used to imbalance the mesh during splitting. The diffusive balancer stops after 1 step. Output says no vertices migrated so it stopped. Why does this happen?

$ ctest -V -R splitAndBalanceMeshEnGPar_1M_2
UpdateCTestConfiguration  from :/lore/tachik/develop/build-engpar-rhel7-cuda/DartConfiguration.tcl
UpdateCTestConfiguration  from :/lore/tachik/develop/build-engpar-rhel7-cuda/DartConfiguration.tcl
Test project /lore/tachik/develop/build-engpar-rhel7-cuda
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 57
    Start 57: splitAndBalanceMeshEnGPar_1M_2

57: Test command: /opt/scorec/spack/install/linux-rhel7-x86_64/gcc-7.3.0/mpich-3.3-diz4f6ieln25ouifyc7ndtqlfksom6nb/bin/mpirun "-np" "2" "./splitAndBalanceMesh" "/lore/tachik/develop/EnGPar/pumi-meshes/upright/upright.dmg" "/lore/tachik/develop/EnGPar/pumi-meshes/upright/1.6M/2p/" "0" "1"
57: Test timeout computed to be: 10000000
57: No protocol specified
57: No protocol specified
57: ENGPAR Git hash ee4af38f7033e0a21be64fe98ff0b0c4b084d37a
57:
57:
57:
57: After Split
57: ENGPAR PARTITION : Empty Parts: 0
57: ENGPAR PARTITION : Disconnected Components: <max,tot> 0.000 0.000
57: ENGPAR PARTITION : Neighbors: <max,min,avg,imb> 1.000 1.000 1.000 1.000
57: ENGPAR PARTITION : Local Vertex: <max,min,avg,imb> 1144060.000 436551.000 790305.500 1.448
57: ENGPAR PARTITION : Total Vertex: <max,min,avg,imb> 1156494.000 448227.000 802360.500 1.441
57: ENGPAR PARTITION : Edges type 0: <max,min,avg,imb> 244036.000 94105.000 169070.500 1.443
57: ENGPAR PARTITION : Edges type 0 Cut: <max,tot> 1926.000 3852.000
57: Split from 2 to 2 parts in 16.4818 seconds
57:
57: ENGPAR Starting diffusion on edge type 0 with imbalances: 1.4476 1.4434
57: ENGPAR Side Tolerance is: 1926
57: ENGPAR   Step took 0.943926 seconds
57: ENGPAR   Imbalances <v, e0, ...>: 1.4372 1.4334
57: ENGPAR     Migrating 8270 vertices took 1.574674 seconds
57: ENGPAR   Step took 0.952459 seconds
57: ENGPAR   Imbalances <v, e0, ...>: 1.4372 1.4334
57: ENGPAR     Migrating 0 vertices took 1.574674 seconds
57: ENGPAR Completed diffusion for edge type 0 in 1 steps and 5.673170 seconds due to nothing was migrated.
57: ENGPAR Starting diffusion on vertices with imbalances: 1.4372 1.4334
57: ENGPAR Side Tolerance is: 2009
57: ENGPAR   Plan was trimmed from 8114 to 8114 vertices
57: ENGPAR   Step took 2.519220 seconds
57: ENGPAR   Imbalances <v, e0, ...>: 1.4269 1.4238
57: ENGPAR     Migrating 8114 vertices took 3.118081 seconds
57: ENGPAR   Plan was trimmed from 0 to 0 vertices
57: ENGPAR   Step took 1.493873 seconds
57: ENGPAR   Imbalances <v, e0, ...>: 1.4269 1.4238
57: ENGPAR     Migrating 0 vertices took 3.118081 seconds
57: ENGPAR Completed diffusion for vertices in 1 steps and 5.920164 seconds due to nothing was migrated.
57: ENGPAR Diffusion completed in 2 iterations in 11.593342 seconds
57: ENGPAR max migration time (s) <total, setup, comm, build> = <3.121249, 0.920186, 0.806553, 1.972501>
57: ENGPAR max migration ratios <setup/total, comm/total, build/total, (setup+comm+build)/total> = <0.294813, 0.258670, 0.632601, 0.999881>
57: ENGPAR min migration ratios <setup/total, comm/total, build/total, (setup+comm+build)/total> = <0.108611, 0.074056, 0.630897, 0.999766>
57: ENGPAR Migration took 3.121249 s, 26.922771% of the total time
57: ENGPAR Planning took 7.760832 s, 66.942141% of the total time
57: ENGPAR Distance Computation (part of Planning) took 3.281604 seconds, 28.305934% of the total time
57:
57: After Balancing
57: ENGPAR PARTITION : Empty Parts: 0
57: ENGPAR PARTITION : Disconnected Components: <max,tot> 0.000 0.000
57: ENGPAR PARTITION : Neighbors: <max,min,avg,imb> 1.000 1.000 1.000 1.000
57: ENGPAR PARTITION : Local Vertex: <max,min,avg,imb> 1127676.000 452935.000 790305.500 1.427
57: ENGPAR PARTITION : Total Vertex: <max,min,avg,imb> 1138026.000 465650.000 801838.000 1.419
57: ENGPAR PARTITION : Edges type 0: <max,min,avg,imb> 240784.000 97453.000 169118.500 1.424
57: ENGPAR PARTITION : Edges type 0 Cut: <max,tot> 2022.000 4044.000
57: Total time to split and balance is: 31.2927 seconds
57:
1/1 Test #57: splitAndBalanceMeshEnGPar_1M_2 ...   Passed   34.85 sec

The following tests passed:
        splitAndBalanceMeshEnGPar_1M_2

100% tests passed, 0 tests failed out of 1

Total Test time (real) =  34.94 sec
diamog commented 4 years ago

I did some poking around and didn't find anything specific that I could see that would cause this. Does this only occur in this split mesh test? Can you check if this occurs on the master branch for the exact same test?

Is the current version of the test pushed? I didn't see the setting of the weights, but its been awhile since I looked at EnGPar code so I could have just missed it.

diamog commented 4 years ago

I've replicated the test on master and the same behavior was found there. Digging in, the growth of part boundary (sideTol) is being triggered which is preventing further balancing. This is mostly due to the heavy initial imbalance and only using two processes.

I added an exception in the above commit to master that ignores the side tolerance on two processes.