charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
200 stars 50 forks source link

megatest and megacon should work for large node counts #1183

Open jcphill opened 7 years ago

jcphill commented 7 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/1183


Some test runtimes appear to scale as O(P) or worse, which makes tests useless for testing large machines. Test complexity should only scale as O(P^2) for small P, and then limit to O(P) complexity so that runtime is constant as P increases. If this is not possible skip the test for large P.

jcphill commented 5 years ago

Original date: 2016-08-24 20:44:15


There are some extremely slow tests even for 64 nodes with +ppn 60:

/home/jphillip/charm/gni-crayxc-persistent-smp-knl-debug/tests/charm++/megatest/pgm
Charm++> Running on Gemini (GNI) with 64 processes
Charm++> static SMSG
Charm++> SMSG memory: 316.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 64,  60 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.7.0-296-g8ce70e0
Warning> using Isomalloc in SMP mode, you may need to run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map : 1-63:16.15+64+128+192
Charm++> set comm 0 on node 0 to core #0
Charm++> Running on 16 unique compute nodes (256-way SMP).
Megatest is running on 64 nodes 3840 processors.
test 0: initiated [groupring (milind)]
test 0: completed (5.53 sec)
...
test 7: initiated [groupsectiontest (ebohm)]
test 7: completed (52.06 sec)
test 8: initiated [multisectiontest (ebohm)]
test 8: completed (19.97 sec)
...
test 16: initiated [migration (jackie)]
test 16: completed (481.20 sec)
...
test 26: initiated [immediatering (gengbin)]
test 26: completed (2.79 sec)
...
test 30: completed (5.71 sec)
test 31: initiated [multi nodering (milind)]
...
test 37: initiated [multi groupsectiontest (ebohm)]
test 37: completed (513.88 sec)
test 38: initiated [multi multisectiontest (ebohm)]
test 38: completed (118.11 sec)
...
test 45: initiated [multi migration (jackie)]
...job times out...