cnr-ibf-pa / hbp-bsp-issues

Ticketing system for developers/testers and power users of the Brain Simulation Platform of the Human Brain Project
4 stars 0 forks source link

Summarising new deployment on Jureca, Piz-Daint and Cineca systems #533

Closed pramodk closed 4 years ago

pramodk commented 4 years ago

This ticket is a placeholder to summarising the new deployment and module structures on different systems.

@jorblancoa : this is the ticket.

Related tasks: #522 and #525

jorblancoa commented 4 years ago

Module structure in the different systems. Everything is python3 except the packages with -python2 suffix.

Jureca jureca-booster

> module use /p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/modules/tcl/linux-centos7-haswell
> module av
-- /p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/modules/tcl/linux-centos7-haswell ---
   brion/3.1.0                       neurodamus-neocortex/0.3-knl    neuron/7.8.0b                     (D)
   libxslt/1.1.33                    neuron/7.8.0b-serial            py-bluepy/0.14.6
   neurodamus-hippocampus/0.4-knl    neuron/7.8.0b-python2-serial    py-bluepyopt/1.9.12
   neurodamus-mousify/0.3-knl        neuron/7.8.0b-python2           py-sonata-network-reduction/0.0.5

Example of usage:

module load Architecture/KNL
module load Intel/2019.5.281-GCC-8.3.0 IntelMPI/2019.6.154
module load Python/3.6.8 SciPy-Stack/2019a-Python-3.6.8
module load HDF5/1.10.5
module use /p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/modules/tcl/linux-centos7-haswell

module load neurodamus-hippocampus/0.4-knl
module load neuron/7.8.0b
module load py-bluepyopt/1.9.12

jureca-cluster

> module use /p/project/cvsk25/software-deployment/HBP/jureca-cluster/25-03-2020/modules/tcl/linux-centos7-haswell
> module av
--- /p/project/cvsk25/software-deployment/HBP/jureca-cluster/25-03-2020/modules/tcl/linux-centos7-haswell ---
   brion/3.1.0                   neuron/7.8.0b-serial                py-bluepy/0.14.6
   neurodamus-hippocampus/0.4    neuron/7.8.0b-python2-serial        py-bluepyopt/1.9.12
   neurodamus-mousify/0.3        neuron/7.8.0b-python2               py-sonata-network-reduction/0.0.5
   neurodamus-neocortex/0.3      neuron/7.8.0b                (D)

Example of usage:

module load Architecture/Haswell
module load Intel/2019.5.281-GCC-8.3.0 IntelMPI/2019.6.154
module load Python/3.6.8 SciPy-Stack/2019a-Python-3.6.8
module load HDF5/1.10.5
module use /p/project/cvsk25/software-deployment/HBP/jureca-cluster/25-03-2020/modules/tcl/linux-centos7-haswell

module load neurodamus-hippocampus/0.4
module load neuron/7.8.0b
module load py-bluepyopt/1.9.12

Piz-Daint

$ module use /apps/hbp/ich002/hbp-spack-deployments/softwares/25-03-2020/install/modules/tcl/cray-cnl7-haswell
$ module av
---- /apps/hbp/ich002/hbp-spack-deployments/softwares/25-03-2020/install/modules/tcl/cray-cnl7-haswell -----
brion/3.1.0                       neurodamus-mousify/0.3            neuron/7.8.0b                     neuron/7.8.0b-python2-serial      py-bluepy/0.14.6                  py-sonata-network-reduction/0.0.6
neurodamus-hippocampus/0.4        neurodamus-neocortex/0.3          neuron/7.8.0b-python2             neuron/7.8.0b-serial              py-bluepyopt/1.9.12

Example of usage:

module load daint-mc cray-python/3.6.5.7 PyExtensions/3.6.5.7-CrayGNU-19.10
module use /apps/hbp/ich002/hbp-spack-deployments/softwares/25-03-2020/install/modules/tcl/cray-cnl7-haswell

# always  load neuron module
module load neuron/7.8.0b

# load only relevant modules needed for your job
module load py-bluepyopt/1.9.12
module load py-sonata-network-reduction/0.0.6
module load neurodamus-hippocampus/0.4

GALILEO

$ module use /gpfs/work/HBP_CDP21_it_1/pkumbhar/HBP/galileo/25-03-2020/modules/tcl/linux-centos7-broadwell
$ module av
------------ /gpfs/work/HBP_CDP21_it_1/pkumbhar/HBP/galileo/25-03-2020/modules/tcl/linux-centos7-broadwell ------------
neuron/7.8.0b                py-bluepyopt/1.9.12
neuron/7.8.0b-python2-serial py-matplotlib/2.2.3
neuron/7.8.0b-serial

Example of usage with BluepyOpt:

module purge
module load intel/pe-xe-2018--binary gnu/7.3.0
module load intelmpi/2018--binary
module load python/3.6.4

module use /gpfs/work/HBP_CDP21_it_1/pkumbhar/HBP/galileo/25-03-2020/modules/tcl/linux-centos7-broadwell
module load neuron/7.8.0b py-bluepyopt/1.9.12 py-matplotlib/2.2.3

Same job script and test run output is in:

/gpfs/work/HBP_CDP21_it_1/pkumbhar/test/bluepyopt_test/job.sh
/gpfs/work/HBP_CDP21_it_1/pkumbhar/test/bluepyopt_test/slurm-3935950.out

Marconi

$ module av
------------ /marconi_work/HBP_CDP2_it/pkumbhar/HBP/marconi/25-03-2020/modules/tcl/linux-centos7-broadwell ------------
neuron/7.8.0b                py-matplotlib/3.1.1
neuron/7.8.0b-python2-serial py-pandas/0.25.1
neuron/7.8.0b-serial         zlib/1.2.11

Example of usage with BluepyOpt:

module purge
module load intel/pe-xe-2018--binary gnu/7.3.0
module load intelmpi/2018--binary
module load python/3.6.4

module use /marconi_work/HBP_CDP2_it/pkumbhar/HBP/marconi/25-03-2020/modules/tcl/linux-centos7-broadwell
module load neuron/7.8.0b py-bluepyopt/1.9.12 py-matplotlib/3.1.1

Same job script and test run output is in:

/marconi_work/HBP_CDP2_it/pkumbhar/test/bluepyopt_test/job.sh
/marconi_work/HBP_CDP2_it/pkumbhar/test/bluepyopt_test/slurm-6753203.out

Note that I was only able to test with 1 core (serial job) on Marconi

antonelepfl commented 4 years ago

Can we add / update the spack wiki with it so we have a centralized place where to look?

jmbudd commented 4 years ago

Tried piz-daint example usage from login and encountered the following error on trying to load neurodamus:

daint104:~> module load daint-mc cray-python/3.6.5.7 PyExtensions/3.6.5.7-CrayGNU-19.10
daint104:~> module use /apps/hbp/ich002/hbp-spack-deployments/softwares/25-03-2020/install/modules/tcl/cray-cnl7-haswell
daint104:~> module list
Currently Loaded Modulefiles:
  1) modules/3.2.11.3(default)                       15) dmapp/7.1.1-7.0.1.1_4.8__g38cf134.ari
  2) cray-mpich/7.7.10(default)                      16) gni-headers/5.0.12.0-7.0.1.1_6.7__g3b1768f.ari
  3) slurm/19.05.3-2                                 17) xpmem/2.2.19-7.0.1.1_3.7__gdcf436c.ari
  4) xalt/2.7.24                                     18) job/2.2.4-7.0.1.1_3.8__g36b56f4.ari
  5) daint-mc                                        19) dvs/2.12_2.2.151-7.0.1.1_5.6__g7eb5e703
  6) cray-python/2.7.15.7                            20) alps/6.6.56-7.0.1.1_4.10__g2e60a7e4.ari
  7) gcc/8.3.0(default)                              21) rca/2.2.20-7.0.1.1_4.9__g8e3fb5b.ari
  8) craype-broadwell                                22) atp/2.1.3(default)
  9) craype-network-aries                            23) perftools-base/7.1.1(default)
 10) craype/2.6.1(default)                           24) PrgEnv-gnu/6.0.5
 11) cray-libsci/19.06.1(default)                    25) cdt/19.10
 12) udreg/2.3.2-7.0.1.1_3.9__g8175d3d.ari           26) CrayGNU/.19.10
 13) ugni/6.0.14.0-7.0.1.1_7.10__ge78e5b0.ari        27) cray-python/3.6.5.7(default)
 14) pmi/5.0.14(default)                             28) PyExtensions/3.6.5.7-CrayGNU-19.10
daint104:~> module load neurodamus-hippocampus/0.4
neuron/7.8.0b(21):ERROR:105: Unable to locate a modulefile for 'mpich/7.7.10'
neurodamus-hippocampus/0.4(33):ERROR:105: Unable to locate a modulefile for 'mpich/7.7.10'
daint104:~> 
mmigliore commented 4 years ago

what about galileo (Cineca)?

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Fri, Mar 27, 2020 at 11:34 AM Jorge Blanco Alonso < notifications@github.com> wrote:

Module structure in the different systems. Everything is python3 except the packages with -python2 suffix.

Jureca jureca-booster

module use /p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/modules/tcl/linux-centos7-haswell module av -- /p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/modules/tcl/linux-centos7-haswell --- brion/3.1.0 neurodamus-neocortex/0.3-knl neuron/7.8.0b (D) libxslt/1.1.33 neuron/7.8.0b-serial py-bluepy/0.14.6 neurodamus-hippocampus/0.4-knl neuron/7.8.0b-python2-serial py-bluepyopt/1.9.12 neurodamus-mousify/0.3-knl neuron/7.8.0b-python2 py-sonata-network-reduction/0.0.5

Example of usage:

module load Architecture/KNL module load Intel/2019.5.281-GCC-8.3.0 IntelMPI/2019.6.154 module load Python/3.6.8 SciPy-Stack/2019a-Python-3.6.8 module use /p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/modules/tcl/linux-centos7-haswell

module load neurodamus-hippocampus/0.4-knl module load neuron/7.8.0b module load py-bluepyopt/1.9.12

jureca-cluster

module use /p/project/cvsk25/software-deployment/HBP/jureca-cluster/25-03-2020/modules/tcl/linux-centos7-haswell module av --- /p/project/cvsk25/software-deployment/HBP/jureca-cluster/25-03-2020/modules/tcl/linux-centos7-haswell --- brion/3.1.0 neuron/7.8.0b-serial py-bluepy/0.14.6 neurodamus-hippocampus/0.4 neuron/7.8.0b-python2-serial py-bluepyopt/1.9.12 neurodamus-mousify/0.3 neuron/7.8.0b-python2 py-sonata-network-reduction/0.0.5 neurodamus-neocortex/0.3 neuron/7.8.0b (D)

Example of usage:

module load Architecture/Haswell module load Intel/2019.5.281-GCC-8.3.0 IntelMPI/2019.6.154 module load Python/3.6.8 SciPy-Stack/2019a-Python-3.6.8 module use /p/project/cvsk25/software-deployment/HBP/jureca-cluster/25-03-2020/modules/tcl/linux-centos7-haswell

module load neurodamus-hippocampus/0.4 module load neuron/7.8.0b module load py-bluepyopt/1.9.12

Piz-Daint

module use /apps/hbp/ich002/hbp-spack-deployments/softwares/25-03-2020/install/modules/tcl/cray-cnl7-haswell module av ---- /apps/hbp/ich002/hbp-spack-deployments/softwares/25-03-2020/install/modules/tcl/cray-cnl7-haswell ----- brion/3.1.0 neuron/7.8.0b py-bluepy/0.14.6 neurodamus-hippocampus/0.4 neuron/7.8.0b-python2 py-bluepyopt/1.9.12 neurodamus-mousify/0.3 neuron/7.8.0b-python2-serial py-sonata-network-reduction/0.0.5 neurodamus-neocortex/0.3 neuron/7.8.0b-serial

Example of usage:

module load daint-mc cray-python/3.6.5.7 PyExtensions/3.6.5.7-CrayGNU-19.10 module use /apps/hbp/ich002/hbp-spack-deployments/softwares/25-03-2020/install/modules/tcl/cray-cnl7-haswell

module load neurodamus-hippocampus/0.4 module load neuron/7.8.0b module load py-bluepyopt/1.9.12

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cnr-ibf-pa/hbp-bsp-issues/issues/533#issuecomment-604928722, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZUP36PYL5QE3EVILJWZ2DRJR6LHANCNFSM4LTSPU2A .

jorblancoa commented 4 years ago

Can we add / update the spack wiki with it so we have a centralized place where to look?

Before making those modules as default, we wanted to inform everyone and hence created this ticket. If there aren't issues, in a day or two, we plan make those default and update Spack wiki pages.

Tried piz-daint example usage from login and encountered the following error on trying to load neurodamus:

We observed that and we are going to fix it. Nonetheless, everything should work because mpich module is pre-loaded on Cray system (cray-mpich/7.7.10)

what about galileo (Cineca)?

Until today we were fixing and testing issues on Jureca and Piz-Daint. We will start updating Galielo on Monday.

clupascu commented 4 years ago

New Neuron module works perfectly on Jureca and PizDaint. New Bluepyopt module works perfectly on PizDaint, but on Jureca booster I get

File "opt_neuron.py", line 87, in from ipyparallel import Client ModuleNotFoundError: No module named 'ipyparallel'

jorblancoa commented 4 years ago

I have done some changes in Jureca to install ipyparallel from source instead of trying to use the external. Could you try again and let me know if you have any other issue?

Thanks!

clupascu commented 4 years ago

@jorblancoa now it works. Thanks.

antonelepfl commented 4 years ago

In Jureca booster I'm trying to use BluePy with requires h5py because I'm getting this error:

Traceback (most recent call last):
  File "/p/project/cvsk25/vsk2512/analysis/create_replay_jureca_0.19.0b.py", line 11, in <module>
    from bluepy.v2 import Simulation, Circuit
  File "/p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/install/linux-centos7-haswell/gcc-8.3.0/py-bluepy-0.14.6-e3jlhr/lib/python3.6/site-packages/bluepy/__init__.py", line 2, in <module>
    from bluepy.api import load_circuit, release_circuit, Circuit, Simulation
  File "/p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/install/linux-centos7-haswell/gcc-8.3.0/py-bluepy-0.14.6-e3jlhr/lib/python3.6/site-packages/bluepy/api.py", line 14, in <module>
    import h5py
  File "/p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/install/linux-centos7-haswell/gcc-8.3.0/py-h5py-2.10.0-hhiclw/lib/python3.6/site-packages/h5py/__init__.py", line 26, in <module>
    from . import _errors
ImportError: libhdf5.so.103: cannot open shared object file: No such file or directory

But when I try to load some h5py library that I found on the modules:

#!/bin/sh -l
. /etc/profile
module --force purge all
module use /p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/modules/tcl/linux-centos7-haswell
module load Architecture/KNL
module load Intel/2019.5.281-GCC-8.3.0 IntelMPI/2019.6.154
module load Python/3.6.8 SciPy-Stack/2019a-Python-3.6.8
export NFRAME=1000
module load py-bluepy/0.14.6
module load h5py/2.9.0-serial-Python-3.6.8  <---- load this
module load neurodamus-hippocampus/0.4-knl
python /p/project/cvsk25/vsk2512/analysis/create_replay_jureca_0.19.0b.py
python /p/project/cvsk25/vsk2512/analysis/simulation_launch_jureca_0.19.0b.py

I get:

NEURON -- VERSION 7.8.0-2-g92a208b+ HEAD (92a208b+) 2019-10-29
Duke, Yale, and the BlueBrain Project -- Copyright 1984-2018
See http://neuron.yale.edu/neuron/credits

dlopen failed -
/p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/install/linux-centos7-haswell/gcc-8.3.0/synapsetool-0.5.8-ddppen/lib/libsyn2.so: undefined symbol: H5Pset_fapl_mpio
dlopen failed -

Is there any other module that I need to load so bluepy can read .bbp files? (I have tried loading Brion and also I get an issue)

jorblancoa commented 4 years ago

Hi @antonelepfl Could you share the example you are trying to run so I can debug it?

h5py should be included when you load py-bluepy module.

module load Architecture/KNL
module load Intel/2019.5.281-GCC-8.3.0 IntelMPI/2019.6.154
module load Python/3.6.8 SciPy-Stack/2019a-Python-3.6.8
module use /p/project/cvsk25/software-deployment/HBP/jureca-booster/25-03-2020/modules/tcl/linux-centos7-haswell
module load py-bluepy/0.14.6

→ python
imPython 3.6.8 (default, Apr  6 2019, 13:11:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import h5py
>>>

In the first error you are getting it looks like the HDF5 library is not found, so loading HDF5/1.10.5 or HDF5/1.10.5-serial should be enough.

antonelepfl commented 4 years ago

Thank you @jorblancoa as you mentioned I had to load the module HDF5/1.10.5 instead of what I was loading in the past h5py/... and it worked so far. I'll make some more tests for simulations and analysis in PizDaint and Jureca and I'll let you know if I have any other issues

mmigliore commented 4 years ago

I tried the deployment on Jureca-booster and it failed. Please look at the two log files in /p/scratch/cvsk25/vsk2500/test_x_sim/

It is not clear to me if they both failed for the same reason or not.

jorblancoa commented 4 years ago

I think you have the same issue than Stefano.

dlopen failed -
libhdf5.so.103: cannot open shared object file: No such file or directory

Loading HDF5/1.10.5 should do the trick. I will update the examples since is a module required by most of our softwares.

mmigliore commented 4 years ago

Oh, I thought that you had already tried a test sim to check for these dependencies. So, should I add module load HDF5/1.10.5 to the simulation script?

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#m_-3304012846079767673_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Mon, Apr 6, 2020 at 3:53 PM Jorge Blanco Alonso notifications@github.com wrote:

I think you have the same issue than Stefano.

dlopen failed - libhdf5.so.103: cannot open shared object file: No such file or directory

Loading HDF5/1.10.5 should do the trick. I will update the examples since is a module required by most of our softwares.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cnr-ibf-pa/hbp-bsp-issues/issues/533#issuecomment-609808619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZUP3ZVJ2NSNFPNFTRWZ2DRLHNEZANCNFSM4LTSPU2A .

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

jorblancoa commented 4 years ago

So, should I add module load HDF5/1.10.5

Exactly

pramodk commented 4 years ago

@jorblancoa : Can you update Spack wiki pages with these instructions? (keep explicit path to module rather symlinks )

I am starting update of Galielo.

mmigliore commented 4 years ago

This is very frustrating. Now I got: . . . Currently Loaded Modules: 1) Architecture/KNL (S) 23) Tk/.8.6.9 (H) 2) GCCcore/.8.3.0 (H) 24) GMP/6.1.2 3) binutils/.2.32 (H) 25) XZ/.5.2.4 (H) 4) icc/.2019.5.281-GCC-8.3.0 (H) 26) libxml2/.2.9.9 (H) 5) ifort/.2019.5.281-GCC-8.3.0 (H) 27) libxslt/.1.1.33 (H) 6) Intel/2019.5.281-GCC-8.3.0 28) libffi/.3.2.1 (H) 7) numactl/2.0.12 29) libyaml/.0.2.2 (H) 8) UCX/1.6.1 30) Java/1.8 9) IntelMPI/2019.6.154 31) PostgreSQL/11.2 10) bzip2/.1.0.6 (H) 32) protobuf/.3.7.1 (H) 11) zlib/.1.2.11 (H) 33) gflags/.2.2.2 (H) 12) ncurses/.6.1 (H) 34) libspatialindex/.1.9.0 (H) 13) libreadline/.8.0 (H) 35) NASM/.2.14.02 (H) 14) Tcl/8.6.9 36) libjpeg-turbo/.2.0.2 (H) 15) SQLite/.3.27.2 (H) 37) Python/3.6.8 16) expat/.2.2.6 (H) 38) imkl/.2019.3.199 (H) 17) libpng/.1.6.36 (H) 39) SciPy-Stack/2019a-Python-3.6.8 18) freetype/.2.10.0 (H) 40) Szip/.2.1.1 (H) 19) gperf/.3.1 (H) 41) HDF5/1.10.5 20) util-linux/.2.33.1 (H) 42) neuron/7.8.0b 21) fontconfig/.2.13.1 (H) 43) neurodamus-hippocampus/0.4-knl 22) X11/20190311

Where: H: Hidden Module S: Module is Sticky, requires --force to unload or purge

kvsprovider[16733]: Timeout: Not all clients called pmi_init(): init=22396 left=11604 round=1 PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired PSI: /ddn/ime/bin" PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired PSI: /ddn/ime/bin" PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired PSI: /ddn/ime/bin" PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired PSI: /ddn/ime/bin"

...and so on until the job was terminated. I thought that this was caused a hardware problem with one of more node, so I have just relaunched the job. I'll keep you posted.

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Mon, Apr 6, 2020 at 8:02 PM Pramod Kumbhar notifications@github.com wrote:

@jorblancoa https://github.com/jorblancoa : Can you update Spack wiki pages with these instructions? (keep explicit path to module rather symlinks )

I am starting update of Galielo.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cnr-ibf-pa/hbp-bsp-issues/issues/533#issuecomment-609948709, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZUP36O5YT2GCJ2YVIZVOLRLIKK5ANCNFSM4LTSPU2A .

pramodk commented 4 years ago
kvsprovider[16733]: Timeout: Not all clients called pmi_init(): init=22396
left=11604 round=1
PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler:
changeToWorkDir(): Timer expired

I looked at this error and it seems like system/hardware error than software. I am looking at currently running job :

]$ squeue -u migliore2
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           8161548   booster  test_l3 migliore  R      43:27    500 jrc[5004-5279,5287-5326,5836-5866,5884-5903,5905-5968,6285-6301,6303-6322,6579-6610]

Which seems to be running fine:

$ tail -f /p/scratch/cvsk25/vsk2500/test_x_sim/slurm-8161548.out
[DEBUG] Memusage [MB]: Max=149.97, Min=82.52, Mean(Stdev)=125.96(8.59)
setpvec 8           Enable Modifications 0.0100002
Enable Reports
Adding report soma for CoreNEURON with 456292 gids
[DEBUG] Memusage [MB]: Max=149.97, Min=82.52, Mean(Stdev)=125.96(8.59)
setpvec 9                 Enable Reports 0.0699999
accum 4                        stdinit 0.24
MemUsage after stdinit
[DEBUG] Memusage [MB]: Max=149.54, Min=80.08, Mean(Stdev)=125.62(8.80)
Starting dataset generation for CoreNEURON
...

(it seems like I/O is slow but we will see the timings after job finishes)

pramodk commented 4 years ago

what about galileo (Cineca)?

@mmigliore @clupascu @ElisabettaGiacalone : I have also updated BluepyOpt and Neuron installation on Galileo as well as Marconi. I provided path to sample job script and test job output in this first comment.

If I am not mistaken, neurodamus is not being used on Marconi. So I didn't generate module for that.

If you see any issue, please let me know.

mmigliore commented 4 years ago

The simulation on Jureca-booster ended after 8hrs for timeout without any output. It seems that it did not even start the simulation: setpvec 79 Enable Reports 0.0599999 accum 46 stdinit 0.27 MemUsage after stdinit [DEBUG] Memusage [MB]: Max=168.49, Min=99.64, Mean(Stdev)=143.06(8.92) Starting dataset generation for CoreNEURON srun: Job step aborted: Waiting up to 6 seconds for job step to finish. error: step 8161548:0 CANCELLED DUE TO TIME LIMIT srun: error: jrc5032: tasks 1904-1971: Terminated srun: error: jrc5026: tasks 1496-1563: Terminated

So we wasted 272000 core/hrs. The same simulation ran without problems on Piz Daint and used 32000core/hrs.

mmigliore commented 4 years ago

BTW, a sim with the previous modules /p/scratch/cvsk25/vsk2505/simulations/CA1.20190306/test20201703/ was ok.

clupascu commented 4 years ago

When trying to use BluePyOpt on Galileo I get

Traceback (most recent call last): File "opt_neuron.py", line 70, in from model.analysis import * File "/galileo/home/userexternal/mmiglior/testBPOPTPYTHON3/model/analysis.py", line 100, in @set_rcoptions File "/galileo/home/userexternal/mmiglior/testBPOPTPYTHON3/model/analysis.py", line 69, in set_rcoptions import matplotlib File "/gpfs/work/HBP_CDP21_it_1/pkumbhar/HBP/galileo/25-03-2020/install/linux-centos7-broadwell/gcc-7.3.0/py-matplotlib-3.1.1-ob5dzp/lib/python3.6/site-packages/matplotlib/init.py", line 205, in _check_versions() File "/gpfs/work/HBP_CDP21_it_1/pkumbhar/HBP/galileo/25-03-2020/install/linux-centos7-broadwell/gcc-7.3.0/py-matplotlib-3.1.1-ob5dzp/lib/python3.6/site-packages/matplotlib/init.py", line 190, in _check_versions from . import ft2font ImportError: /lib64/libz.so.1: version `ZLIB_1.2.9' not found (required by /gpfs/work/HBP_CDP21_it_1/pkumbhar/HBP/galileo/25-03-2020/install/linux-centos7-broadwell/gcc-7.3.0/libpng-1.6.37-dqtdji/lib/libpng16.so.16)

pramodk commented 4 years ago

@clupascu :

When trying to use BluePyOpt on Galileo I get
Traceback (most recent call last):
File "opt_neuron.py", line 70, in 
from model.analysis import *
File "/galileo/home/userexternal/mmiglior/testBPOPTPYTHON3/model/analysis.py", line 100, in 
@set_rcoptions
File "/galileo/home/userexternal/mmiglior/testBPOPTPYTHON3/model/analysis.py", line 69, in set_rcoptions

I can't access above directory.

Did you check the test I ran in :

/marconi_work/HBP_CDP2_it/pkumbhar/test/bluepyopt_test/job.sh
/marconi_work/HBP_CDP2_it/pkumbhar/test/bluepyopt_test/slurm-6753203.out

Note that above test I used single processor because I cannot submit jobs to other partitions. I can redo test if you point me right slurm parameters.

clupascu commented 4 years ago

@pramodk the test you ran in /marconi_work/HBP_CDP2_it/pkumbhar/test was ran on Marconi. I have problems on Galileo. On Galileo I use the slurm parameters below

SBATCH --nodes=6

SBATCH --ntasks-per-node=24

SBATCH --job-name=bluepyopt_ipyparallel

SBATCH --time=1-00:00:00

SBATCH --error=logs/ipyparallel_%j.log

SBATCH --output=logs/ipyparallel_%j.log

SBATCH -p gll_usr_prod

SBATCH -A HBP_CDP21_it_1

clupascu commented 4 years ago

I tested BluePyOpt on PizDaint and Jureca and it works perfectly.

pramodk commented 4 years ago

@pramodk the test you ran in /marconi_work/HBP_CDP2_it/pkumbhar/test was ran on Marconi. I have problems on Galileo.

@clupascu : my mistake, but I meant to point out galileo directory and job script which is also mentioned there:

/gpfs/work/HBP_CDP21_it_1/pkumbhar/test/bluepyopt_test/job.sh

I ran tests there again and worked without any issue:

/gpfs/work/HBP_CDP21_it_1/pkumbhar/test/bluepyopt_test/slurm-3939245.out
/gpfs/work/HBP_CDP21_it_1/pkumbhar/test/bluepyopt_test/slurm-3939323.out
clupascu commented 4 years ago

@pramodk I keep getting that error. It has to do with matplotlib. I gave you access to the folder /galileo/home/userexternal/mmiglior/testBPOPTPYTHON3

pramodk commented 4 years ago

@pramodk I keep getting that error. It has to do with matplotlib. I gave you access to the folder /galileo/home/userexternal/mmiglior/testBPOPTPYTHON3

@clupascu : it's true that the bluepyopt tests we have dont use matplotlib and hence didn't see the issue before. I have fixed the issue and your exampled is tested in:

/gpfs/work/HBP_CDP21_it_1/pkumbhar/testBPOPTPYTHON3

The only thing that you have to change is replace py-matplotlib/3.1.1 with py-matplotlib/2.2.3 on Galileo only (see updated instructions).

Note : This issue is because the python module provided by system admins on Galielo ships matplotlib but thats not the case on Marconi. And hence this confusion.

pramodk commented 4 years ago

The simulation on Jureca-booster ended after 8hrs for timeout without any output. It seems that it > did not even start the simulation: setpvec 79 Enable Reports 0.0599999 accum 46 stdinit 0.27 MemUsage after stdinit [DEBUG] Memusage [MB]: Max=168.49, Min=99.64, Mean(Stdev)=143.06(8.92) Starting dataset generation for CoreNEURON srun: Job step aborted: Waiting up to 6 seconds for job step to finish. error: step 8161548:0 CANCELLED DUE TO TIME LIMIT

So we wasted 272000 core/hrs. The same simulation ran without problems on Piz Daint and used 32000core/hrs.

BTW, a sim with the previous modules /p/scratch/cvsk25/vsk2505/simulations/CA1.20190306/test20201703/ was ok.

@mmigliore : @jorblancoa and me went through various checks and detailed analysis today. We can setup Skype call tomorrow to discuss how should we tackle this issue. Here is a brief summary already:

So what is the issue?

$ grep corewrite /p/scratch/cvsk25/vsk2500/test_x_sim/slurm-8161548.out
accum 5                      corewrite 3794.11
accum 11                      corewrite 3882.58
accum 17                      corewrite 3353.79
accum 23                      corewrite 3562.41
accum 29                      corewrite 3501.89
accum 35                      corewrite 1902.11
accum 41                      corewrite 3505.21

$ grep corewrite /p/scratch/cvsk25/vsk2505/simulations/CA1.20190306/test20201703/slurm-8112285.out
accum 4                      corewrite 1566.56
accum 10                      corewrite 148.04
accum 16                      corewrite 54.13
accum 22                      corewrite 22.11
accum 28                      corewrite 25.11
accum 34                      corewrite 54.58
accum 40                      corewrite 43.78
accum 46                      corewrite 24.57

This is using the same version of neurodamus. This indicates an I/O issue on Jureca booster and unpredictable performance behaviour.

kvsprovider[16733]: Timeout: Not all clients called pmi_init(): init=22396 left=11604 round=1
PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired
PSI: /ddn/ime/bin"
PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired
PSI: /ddn/ime/bin"
PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired
PSI: /ddn/ime/bin"
PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired
PSI: /ddn/ime/bin"

This is again something an issue with the booster or related software stack.

In summary, it seems like this is not something that we can easily control in our software modules. We are going to write to Juelich support about this I/O issues as well as above random failures and see how we can resolve this sooner.

cc: @jamesgking

mmigliore commented 4 years ago

I agree with you that this may be an issue with the I/O system. Please go ahead and open a ticket with Juelich. It may be worth asking for a 1M core/hrs allocation on the normal partition to see if it has the same problem. Always put me in cc, so I can follow up with them in case it is needed.

However, please also note that this occurs systematically when using simulations longer than about 500ms (with the full system). The one I ran was supposed to run for 1000ms sim time. Please rerun my simulation after setting its duration to 500 and you will most likely see normal timings. If this is the case, then it may not be only a hardware/firmware I/O problem.

We can skype today at 12, if this is ok for you.

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

On Wed, Apr 8, 2020 at 12:31 AM Pramod Kumbhar notifications@github.com wrote:

The simulation on Jureca-booster ended after 8hrs for timeout without any output. It seems that it > did not even start the simulation: setpvec 79 Enable Reports 0.0599999 accum 46 stdinit 0.27 MemUsage after stdinit [DEBUG] Memusage [MB]: Max=168.49, Min=99.64, Mean(Stdev)=143.06(8.92) Starting dataset generation for CoreNEURON srun: Job step aborted: Waiting up to 6 seconds for job step to finish. error: step 8161548:0 CANCELLED DUE TO TIME LIMIT

So we wasted 272000 core/hrs. The same simulation ran without problems on Piz Daint and used 32000core/hrs.

BTW, a sim with the previous modules /p/scratch/cvsk25/vsk2505/simulations/CA1.20190306/test20201703/ was ok.

@mmigliore https://github.com/mmigliore : @jorblancoa https://github.com/jorblancoa and me went through various checks and detailed analysis today. We can setup Skype call tomorrow to discuss how should we tackle this issue. Here is a brief summary already:

  • After the new deployment Jorge has done validation and sanity checks for the new modules provided above https://github.com/cnr-ibf-pa/hbp-bsp-issues/issues/533#issuecomment-604928722 (i.e. 25-03-2020).
  • We also compared the behaviour of the new modules with the modules used in vsk2505/simulations/CA1.20190306/test20201703. We ran 40k cells target with old vs new modules and they have similar runtime. So there is nothing different in the new modules and the ones used in vsk2505/simulations/CA1.20190306/test20201703. Below are two jobs for the same:

/p/scratch/cvsk25/blancoalonso1/tickets/BBPP42-445/michele/40kCell_1003 /p/scratch/cvsk25/blancoalonso1/tickets/BBPP42-445/michele/40kCell_2503

So what is the issue?

  • Two months ago when we were fixing morphology loading issues on booster we noticed that the I/O performance on KNLs is significantly bad. KNLs are slower than Xeon (e.g. piz-daint) but even when we compared our BBP KNL system vs Jureca booster, we see significantly low I/O performance on Jureca booster.
  • We compared your yesterday's simulation with the one from simulations/CA1.20190306/test20201703/ and as we can see below, the file I/O performance is suddenly very bad. In below experiment we can see that the I/O was taking ~1min month ago and now it's taking an hour.

$ grep corewrite /p/scratch/cvsk25/vsk2500/test_x_sim/slurm-8161548.out accum 5 corewrite 3794.11 accum 11 corewrite 3882.58 accum 17 corewrite 3353.79 accum 23 corewrite 3562.41 accum 29 corewrite 3501.89 accum 35 corewrite 1902.11 accum 41 corewrite 3505.21

$ grep corewrite /p/scratch/cvsk25/vsk2505/simulations/CA1.20190306/test20201703/slurm-8112285.out accum 4 corewrite 1566.56 accum 10 corewrite 148.04 accum 16 corewrite 54.13 accum 22 corewrite 22.11 accum 28 corewrite 25.11 accum 34 corewrite 54.58 accum 40 corewrite 43.78 accum 46 corewrite 24.57

This is using the same version of neurodamus. This indicates an I/O issue on Jureca booster and unpredictable performance behaviour.

  • During testing today, we also saw random failure that you experienced:

kvsprovider[16733]: Timeout: Not all clients called pmi_init(): init=22396 left=11604 round=1 PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired PSI: /ddn/ime/bin" PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired PSI: /ddn/ime/bin" PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired PSI: /ddn/ime/bin" PSI: handleAnswer: spawn to node 4420 failed: "alarmHandler: changeToWorkDir(): Timer expired PSI: /ddn/ime/bin"

This is again something an issue with the booster or related software stack.

In summary, it seems like this is not something that we can easily control in our software modules. We are going to write to Juelich support about this I/O issues as well as above random failures and see how we can resolve this sooner.

cc: @jamesgking https://github.com/jamesgking

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cnr-ibf-pa/hbp-bsp-issues/issues/533#issuecomment-610652601, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGZUP35PYTHJ7Q47TQGEAOLRLOSV7ANCNFSM4LTSPU2A .

https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon Virus-free. www.avast.com https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

clupascu commented 4 years ago

@clupascu : it's true that the bluepyopt tests we have dont use matplotlib and hence didn't see the issue before. I have fixed the issue.

@pramodk I tested and now it works. Thanks.

clupascu commented 4 years ago

@pramodk I have a problem with BluePyOpt on Jureca booster. When using 2 generations, 128 offsprings, 2 nodes and 64 tasks per node, for example, I get

/usr/local/software/jurecabooster/Stages/2019a/software/Python/3.6.8-GCCcore-8.3.0/lib/python3.6/multiprocessing/popen_fork.py in _launch(self, process_obj)  65 parent_r, child_w = os.pipe() ---> 66 self.pid [0;3.6/multiprocessing/context.py in _Popen(process_obj)  222 def _Popen(process_obj): --> 223 return _default_context.get_context().Process._Popen(process_obj)  224 

/usr/local/software/jurecabooster/Stages/2019a/software/Python/3.6.8-GCCcore-8.3.0/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)  276 from .popen_fork import Popen --> 277 return Popen(process_obj)  278 

/usr/local/software/jurecabooster/Stages/2019a/software4m= os.fork()  67 if self.pid == 0:

OSError: [Errno 12] Cannot allocate memory

I tried with 4 nodes and 32 tasks per node, but I get the same error. Do you know what this might be due to?

clupascu commented 4 years ago

@pramodk We don't get the above error on PizDaint and Galileo.

clupascu commented 4 years ago

@pramodk One more issue I noticed. When running the analysis step (after the optimization process is finished) I get this error TypeError: unsupported operand type(s) for +: 'int' and 'odict_values'. I solved this by converting odict_values into list, but I am not getting this error when using the bluepyopt installation from the collab and I was not getting this error when using Werner's module (export MODULEPATH=/users/bp000178/ich002/software/daint/local-20191210122932/share/modules:$MODULEPATH; module load bpopt). Any suggestion why this is happening with this new stack (it happens on all the systems)?

pramodk commented 4 years ago

@jorblancoa : could you run bluepyopt test on Jureca ? (referring to this comment)

jorblancoa commented 4 years ago

Related to @clupascu memory issue in Jureca Booster when running bluepyopt, a separate ticked has been created: https://github.com/cnr-ibf-pa/hbp-bsp-issues/issues/540

Regarding the deployment of our software stack, I have updated the wiki with the usage of software in Jureca, Piz-Daint, Galileo and Macroni. It can be found here: https://github.com/BlueBrain/spack/wiki

alex4200 commented 4 years ago

Any news on this item?

ElisabettaGiacalone commented 4 years ago

Hi @pramodk @jorblancoa , I tested again one simulation on Jureca's cluster partition. It worked well and it ended after 1:43:00 (simtime=1000ms, reports= soma voltage and AllCompartmentsMembrane currents). I think that the total runtime is comparable with that on other systems, at least for simulations with one report. Outputs file are correct, but at the beginning of the slurm-*.out I get these messages (/p/scratch/cvsk25/vsk2505/simulations/CA1.20190306/type_l_test/slurm-8217527.out): ... MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 MPI startup(): Imported environment partly inaccesible. Map=0 Info=0 [0] MPI startup(): libfabric version: 1.9.0a1-impi [0] MPI startup(): libfabric provider: mlx [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 19719 jrc0057 0 [0] MPI startup(): 1 19712 jrc0057 1 [0] MPI startup(): 2 19716 jrc0057 2 [0] MPI startup(): 3 19717 jrc0057 3 [0] MPI startup(): 4 19713 jrc0057 4 [0] MPI startup(): 5 19715 jrc0057 5 [0] MPI startup(): 6 19714 jrc0057 6 [0] MPI startup(): 7 19721 jrc0057 7 ...

Do you think is it normal?

However, the computing performance is much better on the cluster then on the booster partition, and for this reason, Michele is asking to move the allocation time from booster to cluster partition.

clupascu commented 4 years ago

For the "Non 002-account users have no access to application folder" issue I opened a ticket on the Joint Infrastructure coordination ticketing system at https://gitlab.humanbrainproject.org/joint_infrastructure_coordination/Coordination/issues/163

clupascu commented 4 years ago

Hi @pramodk @jorblancoa,

it seems the recent maintenance on CSCS broke some modules on PizDaint

bp000028@daint101:~> module load daint-mc cray-python/3.6.5.7 PyExtensions/3.6.5.7-CrayGNU-19.10 cray-python(3):ERROR:105: Unable to locate a modulefile for 'cray-python/3.6.5.7' PyExtensions(3):ERROR:105: Unable to locate a modulefile for 'PyExtensions/3.6.5.7-CrayGNU-19.10'

Can you take a look please? Thank you.

clupascu commented 4 years ago

@pramodk @jorblancoa It seems

Python is updated from 3.6 to 3.8. The module cray-python/3.6.5.7 is removed and replaced with cray-python/3.8.2.1. Any virtual environments created using cray-python/3.6.5.7 will need to be recreated with the updated version of Python.

pramodk commented 4 years ago

@clupascu : with the last maintenance above errors are expected because we need to recompile all softwares. I suggest to create new ticket for this instead of reopening this old one. @jorblancoa will be back on Monday and then we will redeploy new stack.

clupascu commented 4 years ago

Ok. I will close this one and open a new one.