E4S-Project / testsuite

E4S test suite with validation tests
MIT License
19 stars 31 forks source link

slepc test fails on perlmutter #53

Closed wspear closed 1 year ago

wspear commented 1 year ago

@balay @joseeroman

The slepc test defined here: https://github.com/E4S-Project/testsuite/tree/master/validation_tests/slepc

Fails for the e4s 22.11 deployment of slepc on perlmutter with this variant:

-- linux-sles15-zen3 / gcc@11.2.0 -------------------------------
5puydjf slepc@3.18.1+arpack~blopex~cuda~rocm build_system=generic

With this console output:

MPICH ERROR [Rank 0] [job id 3727136.93] [Mon Nov 21 12:28:07 2022] [nid001032] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)

aborting job:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(171).......:
MPID_Init(495)..............:
MPIDI_OFI_mpi_init_hook(816):
create_endpoint(1353).......: OFI EP enable failed (ofi_init.c:1353:create_endpoint:Address already in use)
srun: error: nid001032: task 0: Exited with exit code 255
srun: launch/slurm: _step_signal: Terminating StepId=3727136.93
slurmstepd: error: *** STEP 3727136.93 ON nid001032 CANCELLED AT 2022-11-21T20:28:08 ***
srun: error: nid001032: tasks 1-7: Terminated
srun: Force Terminated StepId=3727136.93
balay commented 1 year ago

Not sure what the issue is.

Presumably the PETSc test builds and runs fine?

Can you change the slpec makefile to the following - and see if it makes a difference? [mpi/arpack should already be setup in default targets - shouldn't need to be respecified..]

include ${SLEPC_ROOT}/lib/slepc/conf/slepc_common

Also - if you replace slepc test cde with petsc test code [in slepc test dir] - does it build/run?

balay commented 1 year ago

And I see:

#!/bin/bash
. ../../setup.sh
spackLoadUnique slepc
export SLEPC_DIR=$SLEPC_ROOT

Can you also set PETSC_DIR env variable to the correct location here? [was it not needed before?]

wspear commented 1 year ago

It looks like this was caused by a poisoned runtime environment. This error doesn't appear on a fresh run node.

balay commented 1 year ago

I think it would still be good to fix the makefile as mentioned above

wspear commented 1 year ago

@balay PETSC_DIR didn't seem to make a difference, though I have added it now.

I only see a dash for your suggested change to the makefile.

Is this what it should look like? This seems to work, anyway.

hello: hello.o
        -${CLINKER} -o hello hello.o ${SLEPC_SYS_LIB} 
        ${RM} hello.o

include ${SLEPC_ROOT}/lib/slepc/conf/slepc_common
balay commented 1 year ago

The suggested change is the single include line - as the default targets can handle compile/link any single source file to a binary.

For ex: https://gitlab.com/slepc/slepc/-/blob/main/src/eps/tutorials/makefile

balay commented 1 year ago

BTW: What you have now is also fine (one of the supported usages). I guess its more descriptive then the current default usage

balay commented 1 year ago

But would replace ${SLEPC_SYS_LIB} with ${SLEPC_LIB} - for this usage..

wspear commented 1 year ago

I hadn't realized slepc_common would automatically find the source files. I'll just go with that. Thanks!