NREL / ESIFHPC3

The authoritative collection of benchmarks for NREL's ESIF-HPC-3 system procurement are available here.
3 stars 3 forks source link

[NALU] mesh 256 wl fails on 6144 MPI ranks #6

Open teabagk7 opened 3 years ago

teabagk7 commented 3 years ago

what(): 107: <....>/Trilinos_2/packages/zoltan2/core/src/problems/Zoltan2_PartitioningSolution.hpp,1572 107: error: Value for num_global_parts is different on different processes

192, 384,768, 1536, 3072 - works fine, no such error.

mesh 512 works on 768, 1536, 3072 and 6144!

chchang6 commented 3 years ago

@teabagk7 Thanks for reporting this, I've forwarded to the developer and will update ASAP.

chchang6 commented 3 years ago

@teabagk7 The 256 mesh, 6144-rank run tests out on our system with the reference commit (see the Nalu README for hashes). We will accept results from different commits, since we recognize how much work is required to generate the results you already have. We suggest building the older version of the code to generate the 256 mesh 6144-rank results.

teabagk7 commented 3 years ago

I've built exact the same hashes of Trilinos and Nalu. This problem appears only on mesh 256 test with 96 nodes.

chchang6 commented 3 years ago

I didn't mention a Trilinos hash. The two hashes we mention in the README are of Nalu code. Runs at 6144 ranks and the 256 mesh run to completion on our reference hardware.

What Nalu hash are you working with?

teabagk7 commented 3 years ago

Nalu-Wind Version: v1.2.0 Nalu-Wind GIT Commit SHA: c7c3723261cf1eebe73ef969396d08d342a01644-DIRTY Trilinos Version: 13.1-g53550bee94b TPLs: Boost, HDF5, netCDF, STK, Trilinos, yaml-cpp and zlib

chchang6 commented 3 years ago

Try Nalu-Wind commit 1d3ee2e62ecdd4745d0339a5bf9c5194a07bc93a for the 256 mesh, 6144-rank test.

gcstoianowski commented 3 years ago

Try Nalu-Wind commit 1d3ee2e62ecdd4745d0339a5bf9c5194a07bc93a [...]

[gerardo@login01 build-test]$ git checkout 1d3ee2e62ecdd4745d0339a5bf9c5194a07bc93a fatal: reference is not a tree: 1d3ee2e62ecdd4745d0339a5bf9c5194a07bc93a

chchang6 commented 3 years ago

[cchang@el1 cchang]$ git clone https://github.com/Exawind/nalu-wind.git Cloning into 'nalu-wind'... remote: Enumerating objects: 69, done. remote: Counting objects: 100% (69/69), done. remote: Compressing objects: 100% (56/56), done. remote: Total 25671 (delta 22), reused 36 (delta 13), pack-reused 25602 Receiving objects: 100% (25671/25671), 17.46 MiB | 14.71 MiB/s, done. Resolving deltas: 100% (20518/20518), done. [cchang@el1 cchang]$ cd nalu-wind/ [cchang@el1 nalu-wind]$ git checkout 1d3ee2 Note: checking out '1d3ee2'.

You are in 'detached HEAD' state. You can look around, make experimental changes and commit them, and you can discard any commits you make in this state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may do so (now or later) by using -b with the checkout command again. Example:

git checkout -b new_branch_name

HEAD is now at 1d3ee2e... Updating golds in response to #692.

gcstoianowski commented 3 years ago

Thank you. I was using 'git clone https://github.com/exawind/build-test.git', which I got from Step 4 of https://nalu-wind.readthedocs.io/en/latest/source/user/build_spack.html

chchang6 commented 3 years ago

OK, thanks @gcstoianowski . I'll forward to the benchmark steward to see if we can't clarify the instructions on our end a bit.