Closed fmuelle4711 closed 4 years ago
@fmuelle4711:
Thank you for catching this. Presumably, you discovered this as you built and installed Flux on NCSU's Linux cluster? Do you think Subhendu can post a PR for this issue and have his first exposure to our development workflow?
Not quite, I'm taking care of the package dependencies. Once it's got simple things running, I'll turn it over. But yes, that's the idea.
BTW, there's something more broken in the test harness, I might let him file that one later :-)
On Wed, 12 Feb 2020 14:46:35 -0800 "Dong H. Ahn" notifications@github.com wrote:
@fmuelle4711:
Thank you for catching this. Presumably, you discovered this as you built and installed Flux on NCSU's Linux cluster? Do you think Subhendu can post a PR for this issue and have his first exposure to our development workflow?
Sounds good to me. I just responded to your question on launching MVAPCH and OpenMPI via an email as well.
@garlick / @grondo: do we need to include mpich-devel as well? I noticed it in the Dockerfile for centos7 https://github.com/flux-framework/flux-core/blob/9fbb1c6275b74ba247a937e7737c60688dacec86/src/test/docker/centos7/Dockerfile#L42 but not in the README. In the README, we just have plain mpich
listed.
@garlick / @grondo: do we need to include mpich-devel as well? I noticed it in the Dockerfile for centos7
It doesn't need to be included -- if an mpi compiler isn't found during ./configure
then the only side effect is that the t/mpi/hello
program isn't built during make check
and t3000-mpi-basic.t
tests are skipped.
However, if we're already listing mpich
we should include mpich-devel
because I assume the mpich runtime alone won't do anything.
Looking at the README again, I also think we should make the "only required for make check
" packages in the table more explicit, rather than just "note 3". Maybe split into a separate table?
Maybe so, I'm running OpenHPC, which has its own MPI packages but o/w builds on top of CentOS. I have mvapich2-gnu-ohpc, which includes the headers (i.e., combined regular package + devel). Hence, I can't tell.
On Wed, 12 Feb 2020 18:24:47 -0800 Stephen Herbein notifications@github.com wrote:
@garlick/@grondo: do we need to include mpich-devel as well? I noticed it in the Dockerfile for centos7 (https://github.com/flux-framework/flux-core/blob/9fbb1c6275b74ba247a937e7737c60688dacec86/src/test/docker/centos7/Dockerfile#L42) but not in the README. In the README, we just have plain
mpich
listed.
I have mvapich2-gnu-ohpc, which includes the headers (i.e., combined regular package + devel). H
Flux's configure should be able to happily find and use that version (I think it is just looking for an mpicc
and checking to make sure mpicc
can compile an MPI program?)
That's the problem with saying flux-core "requires" mpich/mpich-devel. I'd hate for users that already have a perfectly fine mpich-based mpi installed to feel like they have to install the distro's mpich-devel package...
@fmuelle4711 is having issues with launching an MPI hello world with his installed MPI.
flux mini run -N 1 -n 2 ./mpi_hello
Assertion failed in file src/mpid/ch3/src/mpid_vc.c at line 1333: val
!= NULL Assertion failed in file src/mpid/ch3/src/mpid_vc.c at line
1333: val != NULL [cli_1]: aborting job:
[cli_0]: aborting job:
internal ABORT - process 0
internal ABORT - process 0
flux-job: task(s) exited with exit code 1
I suspect the installed MVAPICH is configured to use a different bootstrapper than PMI. (Maybe PMIx?)
We should open a separate issue on the MVAPICH bootstrap.
I'm guessing this problem will come up a lot in the early days, so we should come up with a good set of data for users to collect when it does.
For now, try running with flux mini run -o verbose=2 -N 1 -n 2 ./mpi_hello
to see if pmi server in the shell is even active.
Yes I already suggested that.
@fmuelle4711: Given https://www.open-mpi.org/doc/v3.1/man1/prun.1.php, it is likely your MVAPICH is configured to use PMIx. We don't support it and you probably want to build a new MVAPICH with Flux. They should have a configuration option.
Yes, PMIx seems to be the issue. Since I don't want to purge OpenHPC, I decided to install mpich from CentOS. Due to OpenHPC packages, I need to work around dependency problems: yum -y install yum-utils libgfortran.so.3 libhwloc.so.5 yumdownloader mpich-3.0 yumdownloader mpich-3.0-devel rpm -i --nodeps mpich-3.0-* cp /etc/modulefiles/mpi/mpich-3.0-x86_64 /opt/ohpc/pub/modulefiles
After that and as a user, it works for a single node w/ OpenHPC's srun: srun -w c[93-94] -X --pty ~/projects/flux-core/src/cmd/flux start --size=2 module switch mvapich2 mpich-3.0-x86_64 flux mini run -N 1 -n 2 ./mpi_hello Hello from task 1 on c93! Rank 1 slept for 0 secs, woken up now Hello from task 0 on c93! MASTER: Number of MPI tasks is: 2
But this does not: flux mini run -N 2 -n 2 ./mpi_hello Hello from task 0 on c93! MASTER: Number of MPI tasks is: 2 Hello from task 1 on c93! It should pick up c94. Somehow flux kvs does not know about c94.
Any hints? How can I specify hostnames / a hostfile? Or add them to the kvs?
srun -w c[93-94] -X --pty ~/projects/flux-core/src/cmd/flux start --size=2
Can you change this to the following to see if this works better?
srun -w c[93-94] -X --pty --mpi=none ~/projects/flux-core/src/cmd/flux start
Notice this is without --size=2
Yes, that did the trick! Thanks
One more comment about flux-core README.md:
PYTHON_VERSION=3.6 ./configure
is inconsistent w/ flux-sched README.md:
PYTHON_VERSION=3.6 ./configure --prefix=$HOME/local
I'd suggest to use one or the other throughout the READMEs, mixing them causes problems during building/testing.
Very good observation, thanks!
I noticed that in our fluxrm/testenv:centos7
docker image, zeromq4-devel
doesn't exist. It is just zeromq-devel
(version 3 has the explicit version number). This is the same for centos8.
❯ sudo docker run -ti fluxrm/testenv:centos7 yum search zeromq
<snip>
=========================================================== N/S matched: zeromq ============================================================
zeromq-devel.x86_64 : Development files for zeromq
zeromq3-devel.x86_64 : Development files for zeromq3
amavisd-new-snmp-zeromq.noarch : Exports amavisd SNMP data and communicates through 0MQ sockets
amavisd-new-zeromq.noarch : Support for communicating through 0MQ sockets
czmq.x86_64 : High-level C binding for 0MQ (ZeroMQ)
python-txzmq.noarch : Twisted bindings for ZeroMQ
zeromq.x86_64 : Software library for fast, message-based applications
zeromq3.x86_64 : Software library for fast, message-based applications
❯ sudo docker run -ti fluxrm/testenv:centos8 yum search zeromq
CentOS-8 - AppStream <snip>
======================================================= Name Exactly Matched: zeromq =======================================================
zeromq.x86_64 : Software library for fast, message-based applications
zeromq.x86_64 : Software library for fast, message-based applications
====================================================== Name & Summary Matched: zeromq ======================================================
zeromq-devel.x86_64 : Development files for zeromq
zeromq-devel.x86_64 : Development files for zeromq
========================================================= Summary Matched: zeromq ==========================================================
czmq.x86_64 : High-level C binding for 0MQ (ZeroMQ)
czmq.x86_64 : High-level C binding for 0MQ (ZeroMQ)
We also don't put version numbers next to Redhat or Ubuntu in our README. I doubt the package names will be guaranteed to be exact (or even exist) across a broad range of versions. Maybe we should explicitly label the dependency table for Redhat/Centos 7 and Ubuntu 18.04 LTS.
As a summary of the previous discussion in this thread, we need to add to the redhat/centos copy-and-pasteable list and the dependency table:
aspell-en
valgrind-devel
mpich-devel
The devel packages can probably just replace the normal ones (since they presumably depend on and pull in the normal packages).
The testing-only dependencies should be broken out into a separate table.
We already addressed the ./configure --prefix
symmetry over in the flux-sched repo with https://github.com/flux-framework/flux-sched/pull/595
One more thought: Why don't you develop a configure test to detect the presence of PMIx? It would save yourself (less posts about it) and anyone installing a lot of time.
Please keep the suggestions coming. I don't think we could do that one though, since some early adopters of Flux on our sierra system at LLNL (at least) have worked pretty hard to get Flux working in that environment. Plus often configure time != runtime, so it probably wouldn't be a very effective check.
Working well with PMix has been a challenge for us. So far our strategy has been to rely on the PMIx-provided PMI-1 compat libs to bootstrap Flux, but that has run into trouble because various versions of those libraries have bugs (not well tested, apparently). We spent some time trying to integrate OpenPMIx with flux a while back and found the server-side API doesn't work with concurrent, event-driven programs. It expects the server to plug into a traditional pthreads based program. We thought about a from-scratch implementation based on the standard, but there was no reasonable subset we could implement just for bootstrapping MPI without a large effort. So we've got @SteVwonder and other LLNL people participating in PMIx "upstream" standard development to try to make things eventually work for us, but it is slow....
Anyway, just some backstory on whey we haven't done the obvious thing and "just support PMIx".
As one new brush up that the README now needs: there are still lots of references to python 2, but we just switched to python 3.6+. The README should be updated accordingly.
I think with #2887 merged, this issue can be closed. Please open new issues for any remaining README problems.
For "make check" under Rehat distros, the following packages should also be installed: aspell-en valgrind-devel