LSSTDESC / CCL

DESC Core Cosmology Library: cosmology routines with validated numerical accuracy
BSD 3-Clause "New" or "Revised" License
141 stars 64 forks source link

Illegal Instruction on AMD CPUs #694

Closed nkbhan closed 4 years ago

nkbhan commented 4 years ago

I was trying to use the 1.0.0 version of ccl on a centOS cluster. I'm using an anaconda environment with python 3. I installed cmake and swig from using conda and then installed pyccl from pip - all on the login node of this cluster. I can import pyccl in python without a problem on this node:

(base) nbhandar@coma Fisher (naren) λ conda activate ccl3
(ccl3) nbhandar@coma Fisher (naren) λ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyccl
>>> quit()
(ccl3) nbhandar@coma Fisher (naren) λ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
CPU socket(s):         2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Stepping:              7
CPU MHz:               1999.923
BogoMIPS:              3999.44
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15

However on certain compute nodes, namely the ones running AMD cpus, I get an illegal instruction error when trying to import pyccl in python, and python crashes

(base) nbhandar@compute-2-6 Fisher (naren) λ conda activate ccl3
(ccl3) nbhandar@compute-2-6 Fisher (naren) λ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyccl
Illegal instruction
(ccl3) nbhandar@compute-2-6 Fisher (naren) λ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
CPU socket(s):         2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 9
Stepping:              1
CPU MHz:               2400.143
BogoMIPS:              4800.46
Virtualization:        AMD-V
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              5118K
NUMA node0 CPU(s):     0,2,4,6
NUMA node1 CPU(s):     8,10,12,14
NUMA node2 CPU(s):     9,11,13,15
NUMA node3 CPU(s):     1,3,5,7

On the compute nodes with Intel cpus, I do not get this error:

(base) nbhandar@compute-1-17 Fisher (naren) λ conda activate ccl3
(ccl3) nbhandar@compute-1-17 Fisher (naren) λ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyccl
>>> quit()
(ccl3) nbhandar@compute-1-17 Fisher (naren) λ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
CPU socket(s):         2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Stepping:              2
CPU MHz:               2400.043
BogoMIPS:              4799.33
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15
beckermr commented 4 years ago

How did you compile and build pyccl?

My guess is you build on one set of CPUs but the others. You might try reversing the roles, so you always build on the older CPU.

Also, upgrade to v2! You can install it via conda and never have this issue.

tilmantroester commented 4 years ago

I had a similar problem and hunted the cause down to angpow. Removing a -native compiler flag there solved the issue. I never got around making that PR. That was a while ago though. What's the current status of angpow in CCL?

nkbhan commented 4 years ago

I used pip to install pyccl

CC=gcc pip install pyccl

which handled the building and compiling.

I did this on the login node of the cluster I am working on (an intel one). I'll try what you suggested later today and try installing it on an amd compute node instead and see if that fixes it.

nkbhan commented 4 years ago

@beckermr I did a fresh install of pyccl in a new conda environment on one of the amd compute nodes I mentioned and it looks like I can use this pyccl install on any node (amd or intel) without getting the 'illegal instruction' error. Thanks for the tip!

tilmantroester commented 4 years ago

It might be good to keep this open since it's a common problem. For example, we have a heterogeneous cluster with over a dozen of different architectures here. The scheduler can put your job on any of those and it's non-trivial to find which architecture is the "oldest".

beckermr commented 4 years ago

Use the conda package. This is built to handle this situation.

nkbhan commented 4 years ago

I tested conda package to see if it could handle this situation but doing the installation on an intel node on the cluster I was using still led to an illegal instruction error on amd nodes

EDIT: I was using version 2.0.1

beckermr commented 4 years ago

Can you post more info? All conda packages are compiled with flags that enforce them to use very old instruction sets.

beckermr commented 4 years ago

here is an example from the build logs:

$BUILD_PREFIX/bin/x86_64-conda_cos6-linux-gnu-cc -I$BUILD_PREFIX/include -I$SRC_DIR/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -I$PREFIX/include -fdebug-prefix-map=$SRC_DIR=/usr/local/src/conda/pyccl-2.0.1 -fdebug-prefix-map=$PREFIX=/usr/local/src/conda-prefix -O3 -fomit-frame-pointer -fno-common -fPIC -std=gnu99 -DHAVE_ANGPOW -fopenmp -o CMakeFiles/objlib.dir/src/ccl_background.c.o -c $SRC_DIR/src/ccl_background.c

beckermr commented 4 years ago

notice the -mtune=nocona -mtune-haswell

beckermr commented 4 years ago

As a general point, you should always post actual details of the errors when this happens.

beckermr commented 4 years ago

Can you double check you are using the right version of CCL and not the old one you installed?

tilmantroester commented 4 years ago

Does this also apply to the angpow build? The CMake file still got the -march=native there: https://github.com/LSSTDESC/Angpow4CCL/blob/131b280ef7a551baa128f01e4257c83b1d775ae1/CMakeLists.txt#L19

beckermr commented 4 years ago

We don't build angpow right now

beckermr commented 4 years ago

Are these AMD CPUs exceptionally old?

nkbhan commented 4 years ago

Regarding the AMD CPUs, they are "AMD Opteron(tm) Processor 6136" as per /proc/cpuinfo which were launched in 2010 if I am not mistaken.

Here are the steps I took earlier:

  1. On the login node of my cluster, I made a fresh conda environment

    $ grep -i "model name" /proc/cpuinfo
    model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
    $ conda create --name ccl
    $ conda activate ccl
  2. Install ccl from conda-forge

    $ conda install -c conda-forge pyccl
  3. Check the version number:

    $ conda list | grep -i pyccl
    pyccl                     2.0.1            py37h174e469_0    conda-forge 
  4. Test pyccl

    $ python
    Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21)
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pyccl
    >>> quit()
    $
  5. Switch to the compute node

    $ salloc --ntasks=1 --time=00:30:00
    salloc: Granted job allocation 321023
    $ srun --jobid=321023 --pty /bin/bash
  6. Check the cpu type:

    $ grep -i "model name" /proc/cpuinfo
    model name      : AMD Opteron(tm) Processor 6136
  7. test pyccl:

    $ conda activate ccl
    $ python
    Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 21:52:21)
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pyccl
    Illegal instruction
    $

    python crashes and I'm back to the prompt

I'm not sure where to find the build logs, but if there us anything else that might be heplful do let me know

beckermr commented 4 years ago

Can you send me the output of lscpu on the AMD nodes?

nkbhan commented 4 years ago

Sure

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    1
Core(s) per socket:    8
CPU socket(s):         2
NUMA node(s):          4
Vendor ID:             AuthenticAMD
CPU family:            16
Model:                 9
Stepping:              1
CPU MHz:               2400.052
BogoMIPS:              4800.46
Virtualization:        AMD-V
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              5118K
NUMA node0 CPU(s):     0,2,4,6
NUMA node1 CPU(s):     8,10,12,14
NUMA node2 CPU(s):     9,11,13,15
NUMA node3 CPU(s):     1,3,5,7
beckermr commented 4 years ago

More questions I am getting from conda-forge devs. Can you print out

gcc -march=native -Q --help=target

?

nkbhan commented 4 years ago

on the amd node or the node where I installed pyccl?

beckermr commented 4 years ago

an AMD node

nkbhan commented 4 years ago

For the AMD node

$ gcc -march=native -Q --help=target
The following options are target specific:
  -m128bit-long-double                  [disabled]
  -m32                                  [disabled]
  -m3dnow                               [disabled]
  -m3dnowa                              [disabled]
  -m64                                  [enabled]
  -m80387                               [enabled]
  -m96bit-long-double                   [enabled]
  -mabm                                 [enabled]
  -maccumulate-outgoing-args            [disabled]
  -maes                                 [disabled]
  -malign-double                        [disabled]
  -malign-functions=
  -malign-jumps=
  -malign-loops=
  -malign-stringops                     [enabled]
  -march=                               amdfam10
  -masm=
  -mavx                                 [disabled]
  -mbmi                                 [disabled]
  -mbranch-cost=
  -mcld                                 [disabled]
  -mcmodel=
  -mcrc32                               [disabled]
  -mcx16                                [enabled]
  -mf16c                                [disabled]
  -mfancy-math-387                      [enabled]
  -mfma                                 [disabled]
  -mfma4                                [disabled]
  -mforce-drap                          [disabled]
  -mfp-ret-in-387                       [enabled]
  -mfpmath=
  -mfsgsbase                            [disabled]
  -mfused-madd                          [enabled]
  -mglibc                               [enabled]
  -mhard-float                          [enabled]
  -mieee-fp                             [enabled]
  -mincoming-stack-boundary=
  -minline-all-stringops                [disabled]
  -minline-stringops-dynamically        [disabled]
  -mintel-syntax                        [disabled]
  -mlarge-data-threshold=
  -mlwp                                 [disabled]
  -mmmx                                 [disabled]
  -mmovbe                               [disabled]
  -mms-bitfields                        [disabled]
  -mno-align-stringops                  [disabled]
  -mno-fancy-math-387                   [disabled]
  -mno-push-args                        [disabled]
  -mno-red-zone                         [disabled]
  -mno-sse4                             [enabled]
  -momit-leaf-frame-pointer             [disabled]
  -mpc
  -mpclmul                              [disabled]
  -mpopcnt                              [enabled]
  -mpreferred-stack-boundary=
  -mpush-args                           [enabled]
  -mrdrnd                               [disabled]
  -mrecip                               [disabled]
  -mred-zone                            [enabled]
  -mregparm=
  -mrtd                                 [disabled]
  -msahf                                [enabled]
  -msoft-float                          [disabled]
  -msse                                 [disabled]
  -msse2                                [disabled]
  -msse2avx                             [disabled]
  -msse3                                [disabled]
  -msse4                                [disabled]
  -msse4.1                              [disabled]
  -msse4.2                              [disabled]
  -msse4a                               [disabled]
  -msseregparm                          [disabled]
  -mssse3                               [disabled]
  -mstack-arg-probe                     [disabled]
  -mstackrealign                        [enabled]
  -mstringop-strategy=
  -mtbm                                 [disabled]
  -mtls-dialect=
  -mtls-direct-seg-refs                 [enabled]
  -mtune=                               amdfam10
  -muclibc                              [disabled]
  -mveclibabi=
  -mxop                                 [disabled]
beckermr commented 4 years ago

So I just learned a bunch of stuff from the conda-forge dev who was helping me. Here we go!

So if you run gcc -march=nocona -Q --help=target, then you can see what instructions the code was compiled with. These include SSE instructions which are apparently disabled on your AMD CPUs. Thus code from conda-forge won't ever work on these CPUs.

This situation is rather rare and I have not seen it before. Also, googling your CPU model indicates it should have these instructions so I am confused by that. However I think this is what is going on. You might ask your local IT people what is going on there or for the actual docs on your CPUs. I might have found the wrong one.

This also explains in detail what happened before with versions compiled by hand. Compiling pyccl from source on the AMD CPUs worked because they don't put in SSE instructions and so OFC the intel ones can execute the code. However, going to other way won't work because the intel CPUs put in the SSE instructions and the AMD CPUs choke on them.

nkbhan commented 4 years ago

Interesting, thanks for letting me know! I'll reach to my local IT folks to ask about the SSE instructions on the AMD nodes. In the meantime, it seems like installing pyccl on the AMD nodes seems to be the way to go to ensure that I can run it on any compute node of the cluster I'm on.