NVIDIA / hpc-container-maker

HPC Container Maker
Apache License 2.0
452 stars 91 forks source link

is the intel building block working for Singularity? #240

Closed mmiesch closed 4 years ago

mmiesch commented 4 years ago

First, let me take this opportunity to thank you for providing this extremely useful tool. I'm a big fan. We use this for generating a variety of containers.

However, currently I am focusing on intel Singularity containers and I cannot get them to work. After having problems with more sophisticated applications, I went back to a simple recipe file that includes a "hello world" mpi application:

"""Intel/impi Development container
"""

import os

# Base image
Stage0.baseimage('ubuntu:18.04')

Stage0 += apt_get(ospackages=['build-essential','tcsh','csh','ksh','git',
                              'openssh-server','libncurses-dev','libssl-dev',
                              'libx11-dev','less','man-db','tk','tcl','swig',
                              'bc','file','flex','bison','libexpat1-dev',
                              'libxml2-dev','unzip','wish','curl','wget',
                              'libcurl4-openssl-dev','nano','screen', 'libasound2',
                              'libgtk2.0-common','software-properties-common',
                              'libpango-1.0.0','xserver-xorg','dirmngr',
                              'gnupg2','lsb-release','vim'])

# Install Intel compilers, mpi, and mkl 
Stage0 += intel_psxe(eula=True, license=os.getenv('INTEL_LICENSE_FILE',default='intel_license/****.lic'), tarball=os.getenv('INTEL_TARBALL',default='intel_tarballs/parallel_studio_xe_2019_update5_cluster_edition.tgz'))

# Install application
Stage0 += copy(src='hello_world_mpi.c', dest='/root/jedi/hello_world_mpi.c')
Stage0 += shell(commands=['export COMPILERVARS_ARCHITECTURE=intel64',
                          '. /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh',
                         'cd /root/jedi','mpiicc hello_world_mpi.c -o /usr/local/bin/hello_world_mpi -lstdc++'])

Stage0 += runscript(commands=['/bin/bash -l'])

If I build a docker image with this, it works fine:

CNAME=intel19-impi-hello
hpccm --recipe $CNAME.py --format docker > Dockerfile.$CNAME
sudo docker image build -f Dockerfile.${CNAME} -t jedi-${CNAME} .
ubuntu@ip-172-31-87-130:~/jedi$ sudo docker run --rm -it jedi-intel19-impi-hello:latest
root@1dfdbccc1110:/# mpirun -np 4 hello_world_mpi
Hello from rank 1 of 4 running on 1dfdbccc1110
Hello from rank 2 of 4 running on 1dfdbccc1110
Hello from rank 0 of 4 running on 1dfdbccc1110
Hello from rank 3 of 4 running on 1dfdbccc1110

But, if I try to build a singularity image I get this:

hpccm --recipe $CNAME.py --format singularity > Singularity.$CNAME
sudo singularity build $CNAME.sif Singularity.$CNAME
ubuntu@ip-172-31-87-130:~/jedi$ singularity shell -e intel19-impi-hello.sif
Singularity intel19-impi-hello.sif:~/jedi> source /etc/bash.bashrc
ubuntu@ip-172-31-87-130:~/jedi$ mpirun -np 4 hello_world_mpi
[mpiexec@ip-172-31-87-130] enqueue_control_fd (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:70): assert (!closed) failed
[mpiexec@ip-172-31-87-130] launch_bstrap_proxies (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:517): error enqueuing control fd
[mpiexec@ip-172-31-87-130] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:714): unable to launch bstrap proxy
[mpiexec@ip-172-31-87-130] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1919): error setting up the boostrap proxies

I get the same thing if I try to create the Singularity image from the docker image:

sudo singularity build intel19-impi-hello.sif docker-daemon:jedi-intel19-impi-hello:latest

For much more detail, please see the corresponding issue on the sylabs github site

I just wanted to see if anyone here had any tips on building a working singularity container with the intel_psxe building block. Thanks!

samcmill commented 4 years ago

I'll need to dig up an Intel license, but in the meantime I tried using the intel_mpi building block:

Stage0 += baseimage(image='ubuntu:18.04')
Stage0 += gnu()
Stage0 += intel_mpi(eula=True)

Stage0 += copy(src='sources/mpi-hello.c', dest='/mpi-hello.c')
Stage0 += shell(commands=['. /opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/mpivars.sh intel64',
                          'mpicc -o /mpi-hello-c /mpi-hello.c'])

With this, I am able to run in Docker and in Singularity using either a converted Docker image or a Singularity built one.

$ singularity shell intel_mpi.sif 
Singularity> source /etc/bash.bashrc 
smcmillan@smcmillan-dev:~$ mpirun -np 4 /mpi-hello-c
rank 0 of 4 on smcmillan-dev.client.nvidia.com
rank 2 of 4 on smcmillan-dev.client.nvidia.com
rank 3 of 4 on smcmillan-dev.client.nvidia.com
rank 1 of 4 on smcmillan-dev.client.nvidia.com

Can you please give this recipe a try and let me know how it works for your environment?

I am using Singularity 3.5.3 and the image has Intel MPI 2019 Update 6 Build 20191024.

mmiesch commented 4 years ago

Thanks @samcmill for the quick response. You have nicely managed to get to the crux of the problem without having to deal with the time-consuming install of the intel psxe. I built a singularity image with your recipe file and I'm getting this - a little different than before but still a problem:

ubuntu@ip-172-31-87-130:/$ mpirun -np 4 mpi-hello-c
[proxy:0:0@ip-172-31-87-130] HYD_spawn (../../../../../src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:129): [proxy:0:0@ip-172-31-87-130] HYD_spawn (../../../../../src/pm/i_hydra/libhydra/spawn/intel/hydra_spawn.c:129): execvp error on file mpi-hello-c (No such file or directory)
execvp error on file mpi-hello-c (No such file or directory)
[mpiexec@ip-172-31-87-130] wait_proxies_to_terminate (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:532): downstream from host ip-172-31-87-130 was killed by signal 9 (Killed)
[mpiexec@ip-172-31-87-130] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:2084): assert (exitcodes != NULL) failed

However - I just noticed a potential problem - I'm running on an ubuntu 16.04 base system and it looks like it's using the default python, namely 2.7. Could this be an issue?

mmiesch commented 4 years ago

Sorry - silly mistake - . wasn't in my path - I take that back - this works.

mmiesch commented 4 years ago

Inspired by you answer I'm trying to do a multi-stage build where I install intel_mpi in the second stage. How do I copy /usr/local/bin/hello_world_mpi from Stage0 to Stage1?

samcmill commented 4 years ago

The multi-stage recipe would be (manually entered in so there may be some typos):

Stage0 += baseimage(image='ubuntu:18.04', _as='build')
Stage0 += gnu()
Stage0 += intel_mpi(eula=True)

Stage0 += copy(src='sources/mpi-hello.c', dest='/mpi-hello.c')
Stage0 += shell(commands=['. /opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/mpivars.sh intel64',
                          'mpicc -o /mpi-hello-c /mpi-hello.c'])

Stage1 += baseimage(image='ubuntu:18.04')
Stage1 += Stage0.runtime()
Stage1 += copy(_from='build', src='/mpi-hello-c', dest='/mpi-hello-c')

It's good that the intel_mpi building block is working, but it would nice to understand what's happening with the intel_psxe-based install. Can you try bisecting some of the differences to help root cause it? The first thing might to use the same version of Intel MPI?

mmiesch commented 4 years ago

@samcmill - I've gotten a lot of suggestions on this problem - yours was the only one that worked!

Here is my recipe file - I know the baselibs stuff is extraneous and I'm not sure I entered the copy bit right - before I got your email I generated a Dockerfile and then edited manually to copy hello_world_mpi over. I'm having trouble with the _as= bit - probably because I'm using python 2.7 - I need to update that to python3 and clean up this recipe file a bit.

But - the point is - it worked! Using the intel_mpi() building block in the second stage was the key:

# Base image
Stage0.name = 'devel'
Stage0.baseimage(image='ubuntu:18.04')

baselibs = apt_get(ospackages=['build-essential','tcsh','csh','ksh','git',
                              'openssh-server','libncurses-dev','libssl-dev',
                              'libx11-dev','less','man-db','tk','tcl','swig',
                              'bc','file','flex','bison','libexpat1-dev',
                              'libxml2-dev','unzip','wish','curl','wget',
                              'libcurl4-openssl-dev','nano','screen', 'libasound2',
                              'libgtk2.0-common','software-properties-common',
                              'libpango-1.0.0','xserver-xorg','dirmngr',
                              'gnupg2','lsb-release','vim'])
Stage0 += baselibs

# Install Intel compilers, mpi, and mkl 
ilibs = intel_psxe(eula=True, license=os.getenv('INTEL_LICENSE_FILE',default='intel_license/COM_L___LXMW-67CW6CHW.lic'),
                     tarball=os.getenv('INTEL_TARBALL',default='intel_tarballs/parallel_studio_xe_2019_update5_cluster_edition.tgz'))
Stage0 += ilibs

# Install application
Stage0 += copy(src='hello_world_mpi.c', dest='/root/jedi/hello_world_mpi.c')
Stage0 += shell(commands=['export COMPILERVARS_ARCHITECTURE=intel64',
                      '. /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh',
                      'cd /root/jedi','mpiicc hello_world_mpi.c -o /usr/local/bin/hello_world_mpi -lstdc++'])

# Runtime container
Stage1.baseimage(image='ubuntu:18.04')
Stage1 += baselibs
Stage1 += intel_mpi(eula=True)
Stage1 += copy(_from='devel', src='/usr/local/bin/hello_world_mpi', dest='/usr/local/bin/hello_world_mpi')
ubuntu@ip-172-31-87-130:~/jedi/charliecloud$ singularity shell -e $CNAME
Singularity intel19-multi-hello:~/jedi/charliecloud> source /etc/bash.bashrc
ubuntu@ip-172-31-87-130:~/jedi/charliecloud$ mpirun -np 4 hello_world_mpi
Hello from rank 1 of 4 running on ip-172-31-87-130
Hello from rank 2 of 4 running on ip-172-31-87-130
Hello from rank 3 of 4 running on ip-172-31-87-130
Hello from rank 0 of 4 running on ip-172-31-87-130
mmiesch commented 4 years ago

Oh - however - I was also having a problem with the .runtime() method for intel psxe, getting an error like this:

Warning: apt-key output should not be parsed (stdout is not a terminal)
gpg: no valid OpenPGP data found.  

I had to do something more like this:

mkdir -p /root/tmp
cd /root/tmp
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
rm GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list'
sh -c 'echo deb https://apt.repos.intel.com/mpi all main > /etc/apt/sources.list.d/intel-mpi.list'
sh -c 'echo deb https://apt.repos.intel.com/tbb all main > /etc/apt/sources.list.d/intel-tbb.list'
sh -c 'echo deb https://apt.repos.intel.com/ipp all main > /etc/apt/sources.list.d/intel-ipp.list'
apt-get update
apt-get install intel-mpi-rt-2019.6-166
apt-get install intel-mkl-2019.6-166
mmiesch commented 4 years ago

Yeah - I'm getting this, even though I did a pip3 install:

ubuntu@ip-172-31-87-130:~/jedi/charliecloud$ python --version
Python 3.5.2
ubuntu@ip-172-31-87-130:~/jedi/charliecloud$ hpccm --recipe itest.py --format docker > Dockerfile.itest
ERROR: baseimage() got an unexpected keyword argument '_as'

This is the command it's complaining about

Stage0.baseimage(image='ubuntu:18.04',_as='devel')

I'm using an up-to-date version of hpccm

ubuntu@ip-172-31-87-130:~/jedi/hpc-container-maker$ git branch
* (HEAD detached at v20.2.0)
  master
ubuntu@ip-172-31-87-130:~/jedi/hpc-container-maker$ sudo -H pip3 install hpccm
Requirement already satisfied: hpccm in /usr/local/lib/python3.5/dist-packages (20.2.0)
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from hpccm) (1.10.0)
Requirement already satisfied: enum34 in /usr/local/lib/python3.5/dist-packages (from hpccm) (1.1.10)
samcmill commented 4 years ago

Ah - I see. This isn't a Python 2 vs 3 issue.

There are 2 ways to specify the base image:

Stage0.name = 'devel'
Stage0.baseimage(image='ubuntu:18.04')

and

Stage0 += baseimage(image='ubuntu:18.04', _as='devel')

These two are equivalent. But you can't use the _as option or any of the other baseimage primitive options with the first one.

I mostly default to the second one nowadays, but the first is valid too.

mmiesch commented 4 years ago

Thanks @samcmill, but I'm not using both - just this fails:

ubuntu@ip-172-31-87-130:~/jedi/charliecloud$ hpccm --recipe itest2.py --format docker > Dockerfile.itest2
ERROR: baseimage() got an unexpected keyword argument '_as'
ubuntu@ip-172-31-87-130:~/jedi/charliecloud$ cat itest2.py 

# Base image
Stage0.baseimage(image='ubuntu:18.04',_as='devel')

Stage0 += gnu()
samcmill commented 4 years ago

Sorry for not being clearer. You can't use the _as option with Stage0.basename(), only Stage0 += basename(). Use Stage0.name = '...' when using Stage0.basename().

mmiesch commented 4 years ago

Ahhh - got it - no, I should apologize for not looking closely at what you wrote. Thanks again.

samcmill commented 4 years ago

The HPCCM related aspects of this issue seem to be resolved, so closing. Please reopen or start a new issue if there are further questions.

VI-gha commented 3 years ago

@samcmill

I am resurrecting this thread because I wonder if there is a more elegant way to automatically add the command

'. /opt/intel/compilers_and_libraries/linux/mpi/intel64/bin/mpivars.sh intel64'

so that all the blocks following the installation of the Intel compiler can use it?

(at the moment I manually edit the Dockerfile or Singularity .def to add it every time the compiler is needed)

samcmill commented 3 years ago

See the documentation for the mpivars option in the intel_mpi building block.