TriBITSPub / TriBITS

TriBITS: Tribal Build, Integrate, and Test System,
http://tribits.org
Other
36 stars 47 forks source link

Tribits: MPI_EXEC if unaccessible doesn't produce an error related to MPI but fails obscurely #594

Open jjellio opened 10 months ago

jjellio commented 10 months ago

I believe this falls under Tribits, since I think Trilinos doesn't use the official Kitware MPI module.

What happened is a user on the El Capitan systems was using my build setup, which set `-DMPI_EXEC=/some/path/they/cant/read'

The user's configure only gave this as an error:

CMake Error at cmake/tribits/core/package_arch/TribitsAddTestHelpers.cmake:688 (add_test):
  add_test given test NAME "TeuchosCore_GlobalMPISessionUnitTests_MPI_1"
  which already exists in this directory.
Call Stack (most recent call first):
  cmake/tribits/core/package_arch/TribitsAddTestHelpers.cmake:851 (tribits_add_test_add_test)
  cmake/tribits/core/package_arch/TribitsAddTest.cmake:1009 (tribits_add_test_add_test_all)
  packages/tconfigure_trilinos_performance.logeuchos/core/test/UnitTest/CMakeLists.txt:175 (TRIBITS_ADD_TEST)

Yet, if I took the same configure+environment, it worked.... so I diffed the CMakeCache... and we see:

//Where can one of the /p/lustre1/jjellio/EMPIRE-work/flux-wrap,    
  // mpiexec or mpirun libraries be found
 // mpiexec or mpirun libraries be found                             
  MPI_EXEC:FILEPATH=MPI_EXEC-NOTFOUND  

Looking in their configure log (saved cmake output), this is reported, but it is not an error:

Probing the environment ...

-- USE_XSDK_DEFAULTS='FALSE'
-- BUILD_SHARED_LIBS='OFF'
-- CMAKE_BUILD_TYPE='Release'
-- MPI_USE_COMPILER_WRAPPERS='OFF'
-- MPI_EXEC='/p/lustre1/jjellio/EMPIRE-work/flux-wrap'
-- MPI_EXEC='MPI_EXEC-NOTFOUND'
-- The C compiler identification is Clang 16.0.0

Instead, the user got the Teuchos unit test error which would send most users looking in the wrong direction.

We could probably produce this failure anywhere. Just set MPI_EXEC to a location you can't read.

bartlettroscoe commented 10 months ago

@jjellio, so we just want an explicit check that if TPL_ENABLE_MPI=ON then MPI_EXEC must evaluate to true? (Note, any value ending in -NOTFOUND evaluates to false in CMake, see here).

jjellio commented 10 months ago

I think so - or a clearer error/warning that MPI didn't setup correctly inside CMake. The user came to me with this obscure unit test configure error. I'm guessing Tribits requires that MPI_EXEC be defined by that point, but I guess no where is that assumption enforced?

jjellio commented 10 months ago

It seems in a perfect world,


-- USE_XSDK_DEFAULTS='FALSE'
-- BUILD_SHARED_LIBS='OFF'
-- CMAKE_BUILD_TYPE='Release'
-- MPI_USE_COMPILER_WRAPPERS='OFF'
-- MPI_EXEC='/p/lustre1/jjellio/EMPIRE-work/flux-wrap'
-- MPI_EXEC='MPI_EXEC-NOTFOUND'
Error: /p/lustre1/jjellio/EMPIRE-work/flux-wrap was not found but was set.
jjellio commented 10 months ago

Just throw an error there, so that latter stuff can assume MPI_EXEC must be legal

bartlettroscoe commented 10 months ago

I think so - or a clearer error/warning that MPI didn't setup correctly inside CMake. The user came to me with this obscure unit test configure error. I'm guessing Tribits requires that MPI_EXEC be defined by that point, but I guess no where is that assumption enforced?

@jjellio, it is interesting that it has taken 15 years for this issue to come up. (I don't ever remember hearing about a use case like this.)

bartlettroscoe commented 10 months ago

This is trivial to address, except for perhaps an automated test to ensure that a "false" MPI_EXEC throws an error correctly. (Most of these types of issues take about 10x the amount of test code to library code to test.)

bartlettroscoe commented 10 months ago

@jjellio, how urgent is this? Sounds like other users might hit this on that system?

jjellio commented 10 months ago

Not urgent. I did want to report it though.

I'm guessing this will come up again though. If you remember we wrote a trilinos_jsrun for the ATS-2 testing.

Well on El Cap, we don't need that type of tool - but to get faster testing throughput, I wrote a tool I shared with the FLUX team (the job manager on El Cap), that allows for much higher job throughput. But that script was in folder that my colleague couldn't see. Hopefully we won't need middle-man scripts like that on ATS-4 (El Cap), but in the stand-up process they are nice.

What stinks is how ambiguous the error was. I guess I have enough intuition now to know how to diagnose things better. But anyone else that would have seen this likely would have done alot of wrong things trying to resolve it.

bartlettroscoe commented 10 months ago

What stinks is how ambiguous the error was.

@jjellio, many errors often are, depending who sees them. That is why it is good to be paranoid and check everything you can (unless there is a significant performance hit, which is often not the case for issues like this).