Open jjellio opened 10 months ago
@jjellio, so we just want an explicit check that if TPL_ENABLE_MPI=ON
then MPI_EXEC
must evaluate to true? (Note, any value ending in -NOTFOUND
evaluates to false in CMake, see here).
I think so - or a clearer error/warning that MPI didn't setup correctly inside CMake. The user came to me with this obscure unit test configure error. I'm guessing Tribits requires that MPI_EXEC be defined by that point, but I guess no where is that assumption enforced?
It seems in a perfect world,
-- USE_XSDK_DEFAULTS='FALSE'
-- BUILD_SHARED_LIBS='OFF'
-- CMAKE_BUILD_TYPE='Release'
-- MPI_USE_COMPILER_WRAPPERS='OFF'
-- MPI_EXEC='/p/lustre1/jjellio/EMPIRE-work/flux-wrap'
-- MPI_EXEC='MPI_EXEC-NOTFOUND'
Error: /p/lustre1/jjellio/EMPIRE-work/flux-wrap was not found but was set.
Just throw an error there, so that latter stuff can assume MPI_EXEC must be legal
I think so - or a clearer error/warning that MPI didn't setup correctly inside CMake. The user came to me with this obscure unit test configure error. I'm guessing Tribits requires that MPI_EXEC be defined by that point, but I guess no where is that assumption enforced?
@jjellio, it is interesting that it has taken 15 years for this issue to come up. (I don't ever remember hearing about a use case like this.)
This is trivial to address, except for perhaps an automated test to ensure that a "false" MPI_EXEC
throws an error correctly. (Most of these types of issues take about 10x the amount of test code to library code to test.)
@jjellio, how urgent is this? Sounds like other users might hit this on that system?
Not urgent. I did want to report it though.
I'm guessing this will come up again though. If you remember we wrote a trilinos_jsrun
for the ATS-2 testing.
Well on El Cap, we don't need that type of tool - but to get faster testing throughput, I wrote a tool I shared with the FLUX team (the job manager on El Cap), that allows for much higher job throughput. But that script was in folder that my colleague couldn't see. Hopefully we won't need middle-man scripts like that on ATS-4 (El Cap), but in the stand-up process they are nice.
What stinks is how ambiguous the error was. I guess I have enough intuition now to know how to diagnose things better. But anyone else that would have seen this likely would have done alot of wrong things trying to resolve it.
What stinks is how ambiguous the error was.
@jjellio, many errors often are, depending who sees them. That is why it is good to be paranoid and check everything you can (unless there is a significant performance hit, which is often not the case for issues like this).
I believe this falls under Tribits, since I think Trilinos doesn't use the official Kitware MPI module.
What happened is a user on the El Capitan systems was using my build setup, which set `-DMPI_EXEC=/some/path/they/cant/read'
The user's configure only gave this as an error:
Yet, if I took the same configure+environment, it worked.... so I diffed the CMakeCache... and we see:
Looking in their configure log (saved cmake output), this is reported, but it is not an error:
Instead, the user got the Teuchos unit test error which would send most users looking in the wrong direction.
We could probably produce this failure anywhere. Just set
MPI_EXEC
to a location you can't read.