flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
167 stars 50 forks source link

README.md: missing Redhat dependencies #2737

Closed fmuelle4711 closed 4 years ago

fmuelle4711 commented 4 years ago

For "make check" under Rehat distros, the following packages should also be installed: aspell-en valgrind-devel

dongahn commented 4 years ago

@fmuelle4711:

Thank you for catching this. Presumably, you discovered this as you built and installed Flux on NCSU's Linux cluster? Do you think Subhendu can post a PR for this issue and have his first exposure to our development workflow?

fmuelle4711 commented 4 years ago

Not quite, I'm taking care of the package dependencies. Once it's got simple things running, I'll turn it over. But yes, that's the idea.

BTW, there's something more broken in the test harness, I might let him file that one later :-)

On Wed, 12 Feb 2020 14:46:35 -0800 "Dong H. Ahn" notifications@github.com wrote:

@fmuelle4711:

Thank you for catching this. Presumably, you discovered this as you built and installed Flux on NCSU's Linux cluster? Do you think Subhendu can post a PR for this issue and have his first exposure to our development workflow?

dongahn commented 4 years ago

Sounds good to me. I just responded to your question on launching MVAPCH and OpenMPI via an email as well.

SteVwonder commented 4 years ago

@garlick / @grondo: do we need to include mpich-devel as well? I noticed it in the Dockerfile for centos7 https://github.com/flux-framework/flux-core/blob/9fbb1c6275b74ba247a937e7737c60688dacec86/src/test/docker/centos7/Dockerfile#L42 but not in the README. In the README, we just have plain mpich listed.

grondo commented 4 years ago

@garlick / @grondo: do we need to include mpich-devel as well? I noticed it in the Dockerfile for centos7

It doesn't need to be included -- if an mpi compiler isn't found during ./configure then the only side effect is that the t/mpi/hello program isn't built during make check and t3000-mpi-basic.t tests are skipped.

However, if we're already listing mpich we should include mpich-devel because I assume the mpich runtime alone won't do anything.

Looking at the README again, I also think we should make the "only required for make check" packages in the table more explicit, rather than just "note 3". Maybe split into a separate table?

fmuelle4711 commented 4 years ago

Maybe so, I'm running OpenHPC, which has its own MPI packages but o/w builds on top of CentOS. I have mvapich2-gnu-ohpc, which includes the headers (i.e., combined regular package + devel). Hence, I can't tell.

On Wed, 12 Feb 2020 18:24:47 -0800 Stephen Herbein notifications@github.com wrote:

@garlick/@grondo: do we need to include mpich-devel as well? I noticed it in the Dockerfile for centos7 (https://github.com/flux-framework/flux-core/blob/9fbb1c6275b74ba247a937e7737c60688dacec86/src/test/docker/centos7/Dockerfile#L42) but not in the README. In the README, we just have plain mpich listed.

grondo commented 4 years ago

I have mvapich2-gnu-ohpc, which includes the headers (i.e., combined regular package + devel). H

Flux's configure should be able to happily find and use that version (I think it is just looking for an mpicc and checking to make sure mpicc can compile an MPI program?)

That's the problem with saying flux-core "requires" mpich/mpich-devel. I'd hate for users that already have a perfectly fine mpich-based mpi installed to feel like they have to install the distro's mpich-devel package...

dongahn commented 4 years ago

@fmuelle4711 is having issues with launching an MPI hello world with his installed MPI.

flux mini run -N 1 -n 2 ./mpi_hello
Assertion failed in file src/mpid/ch3/src/mpid_vc.c at line 1333: val
!= NULL Assertion failed in file src/mpid/ch3/src/mpid_vc.c at line
1333: val != NULL [cli_1]: aborting job:
[cli_0]: aborting job:
internal ABORT - process 0
internal ABORT - process 0
flux-job: task(s) exited with exit code 1

I suspect the installed MVAPICH is configured to use a different bootstrapper than PMI. (Maybe PMIx?)

grondo commented 4 years ago

We should open a separate issue on the MVAPICH bootstrap.

I'm guessing this problem will come up a lot in the early days, so we should come up with a good set of data for users to collect when it does.

For now, try running with flux mini run -o verbose=2 -N 1 -n 2 ./mpi_hello to see if pmi server in the shell is even active.

dongahn commented 4 years ago

Yes I already suggested that.

dongahn commented 4 years ago

@fmuelle4711: Given https://www.open-mpi.org/doc/v3.1/man1/prun.1.php, it is likely your MVAPICH is configured to use PMIx. We don't support it and you probably want to build a new MVAPICH with Flux. They should have a configuration option.

fmuelle4711 commented 4 years ago

Yes, PMIx seems to be the issue. Since I don't want to purge OpenHPC, I decided to install mpich from CentOS. Due to OpenHPC packages, I need to work around dependency problems: yum -y install yum-utils libgfortran.so.3 libhwloc.so.5 yumdownloader mpich-3.0 yumdownloader mpich-3.0-devel rpm -i --nodeps mpich-3.0-* cp /etc/modulefiles/mpi/mpich-3.0-x86_64 /opt/ohpc/pub/modulefiles

After that and as a user, it works for a single node w/ OpenHPC's srun: srun -w c[93-94] -X --pty ~/projects/flux-core/src/cmd/flux start --size=2 module switch mvapich2 mpich-3.0-x86_64 flux mini run -N 1 -n 2 ./mpi_hello Hello from task 1 on c93! Rank 1 slept for 0 secs, woken up now Hello from task 0 on c93! MASTER: Number of MPI tasks is: 2

But this does not: flux mini run -N 2 -n 2 ./mpi_hello Hello from task 0 on c93! MASTER: Number of MPI tasks is: 2 Hello from task 1 on c93! It should pick up c94. Somehow flux kvs does not know about c94.

Any hints? How can I specify hostnames / a hostfile? Or add them to the kvs?

dongahn commented 4 years ago

srun -w c[93-94] -X --pty ~/projects/flux-core/src/cmd/flux start --size=2

Can you change this to the following to see if this works better?

srun -w c[93-94] -X --pty --mpi=none ~/projects/flux-core/src/cmd/flux start

Notice this is without --size=2

fmuelle4711 commented 4 years ago

Yes, that did the trick! Thanks

fmuelle4711 commented 4 years ago

One more comment about flux-core README.md:

PYTHON_VERSION=3.6 ./configure

is inconsistent w/ flux-sched README.md:

PYTHON_VERSION=3.6 ./configure --prefix=$HOME/local

I'd suggest to use one or the other throughout the READMEs, mixing them causes problems during building/testing.

grondo commented 4 years ago

Very good observation, thanks!

SteVwonder commented 4 years ago

I noticed that in our fluxrm/testenv:centos7 docker image, zeromq4-devel doesn't exist. It is just zeromq-devel (version 3 has the explicit version number). This is the same for centos8.

❯ sudo docker run -ti fluxrm/testenv:centos7 yum search zeromq
<snip> 
=========================================================== N/S matched: zeromq ============================================================
zeromq-devel.x86_64 : Development files for zeromq
zeromq3-devel.x86_64 : Development files for zeromq3
amavisd-new-snmp-zeromq.noarch : Exports amavisd SNMP data and communicates through 0MQ sockets
amavisd-new-zeromq.noarch : Support for communicating through 0MQ sockets
czmq.x86_64 : High-level C binding for 0MQ (ZeroMQ)
python-txzmq.noarch : Twisted bindings for ZeroMQ
zeromq.x86_64 : Software library for fast, message-based applications
zeromq3.x86_64 : Software library for fast, message-based applications

❯ sudo docker run -ti fluxrm/testenv:centos8 yum search zeromq
CentOS-8 - AppStream                                                                                         <snip>
======================================================= Name Exactly Matched: zeromq =======================================================
zeromq.x86_64 : Software library for fast, message-based applications
zeromq.x86_64 : Software library for fast, message-based applications
====================================================== Name & Summary Matched: zeromq ======================================================
zeromq-devel.x86_64 : Development files for zeromq
zeromq-devel.x86_64 : Development files for zeromq
========================================================= Summary Matched: zeromq ==========================================================
czmq.x86_64 : High-level C binding for 0MQ (ZeroMQ)
czmq.x86_64 : High-level C binding for 0MQ (ZeroMQ)

We also don't put version numbers next to Redhat or Ubuntu in our README. I doubt the package names will be guaranteed to be exact (or even exist) across a broad range of versions. Maybe we should explicitly label the dependency table for Redhat/Centos 7 and Ubuntu 18.04 LTS.


As a summary of the previous discussion in this thread, we need to add to the redhat/centos copy-and-pasteable list and the dependency table:

aspell-en
valgrind-devel
mpich-devel

The devel packages can probably just replace the normal ones (since they presumably depend on and pull in the normal packages).

The testing-only dependencies should be broken out into a separate table.

We already addressed the ./configure --prefix symmetry over in the flux-sched repo with https://github.com/flux-framework/flux-sched/pull/595

fmuelle4711 commented 4 years ago

One more thought: Why don't you develop a configure test to detect the presence of PMIx? It would save yourself (less posts about it) and anyone installing a lot of time.

garlick commented 4 years ago

Please keep the suggestions coming. I don't think we could do that one though, since some early adopters of Flux on our sierra system at LLNL (at least) have worked pretty hard to get Flux working in that environment. Plus often configure time != runtime, so it probably wouldn't be a very effective check.

Working well with PMix has been a challenge for us. So far our strategy has been to rely on the PMIx-provided PMI-1 compat libs to bootstrap Flux, but that has run into trouble because various versions of those libraries have bugs (not well tested, apparently). We spent some time trying to integrate OpenPMIx with flux a while back and found the server-side API doesn't work with concurrent, event-driven programs. It expects the server to plug into a traditional pthreads based program. We thought about a from-scratch implementation based on the standard, but there was no reasonable subset we could implement just for bootstrapping MPI without a large effort. So we've got @SteVwonder and other LLNL people participating in PMIx "upstream" standard development to try to make things eventually work for us, but it is slow....

Anyway, just some backstory on whey we haven't done the obvious thing and "just support PMIx".

SteVwonder commented 4 years ago

As one new brush up that the README now needs: there are still lots of references to python 2, but we just switched to python 3.6+. The README should be updated accordingly.

garlick commented 4 years ago

I think with #2887 merged, this issue can be closed. Please open new issues for any remaining README problems.