Compiling MuSTEM on Power9 architecture

HamishGBrown / MuSTEM

Open source version of the MuSTEM multislice electron microscopy simulation code, developed at the University of Melbourne.

http://tcmp.ph.unimelb.edu.au/mustem/muSTEM.html

GNU General Public License v3.0

54 stars 26 forks source link

Compiling MuSTEM on Power9 architecture #19

Closed jat255 closed 5 years ago

jat255 commented 5 years ago

I have access to an IBM Power9 GPU cluster. Obviously, the MKL libraries are not available for this architecture, but I was planning on trying to use the Makefile posted by @bryandesser. I'm relatively inexperienced when it comes to compiling things for clusters and fortran in general, but I was hoping to get this working on this system, since it has many powerful GPUs.

I'll update with how I make out, but I figured I should ask to see if there's any known limitations in the code that would prevent it from working without access to MKL.

bryandesser commented 5 years ago

@jat255 I'm not familiar with working on Power9, but I have a good bit of experience compiling on clusters. Happy to help where possible

jat255 commented 5 years ago

Okay, I appear to have gotten it compiled using the following Makefile. I did not use OpenBLAS specifically, since it looks like the PGI compiler has a libblas.so file, so I'm hoping that's good enough.

Makefile:

PGF_FLAGS= -c -g -O3 -Mpreprocess -Mbackslash -Mconcur -Mextend -Mcuda=cuda10.1 -Dsingle_precision -ta=tesla -Mfree -Mrecursive -DGPU

Hn0_tcmp: intermediate
        pgf90 -Mcuda=cuda10.1 -o MU_STEM.out *.o -L${CUDA_HOME}/lib -lcufft $(THREAD_LINK)
modules:
        pgf90 $(PGF_FLAGS) quadpack.f90
        pgf90 $(PGF_FLAGS) m_precision.f90
        pgf90 $(PGF_FLAGS) m_string.f90
        pgf90 $(PGF_FLAGS) m_numerical_tools.f90
        pgf90 $(PGF_FLAGS) mod_global_variables.f90
        pgf90 $(PGF_FLAGS) m_crystallography.f90
        pgf90 $(PGF_FLAGS) m_electron.f90
        pgf90 $(PGF_FLAGS) m_user_input.f90
        pgf90 $(PGF_FLAGS) mod_cufft.f90
        pgf90 $(PGF_FLAGS) mod_CUFFT_wrapper.f90
        pgf90 $(PGF_FLAGS) mod_output.f90
        pgf90 $(PGF_FLAGS) m_multislice.f90
        pgf90 $(PGF_FLAGS) m_lens.f90
        pgf90 $(PGF_FLAGS) m_tilt.f90
        pgf90 $(PGF_FLAGS) m_absorption.f90
        pgf90 $(PGF_FLAGS) mod_cuda_array_library.f90
        pgf90 $(PGF_FLAGS) mod_cuda_potential.f90
        pgf90 $(PGF_FLAGS) m_potential.f90
        pgf90 $(PGF_FLAGS) MS_utilities.f90
        pgf90 $(PGF_FLAGS) mod_cuda_setup.f90
        pgf90 $(PGF_FLAGS) mod_cuda_ms.f90
        pgf90 $(PGF_FLAGS) s_absorptive_stem.f90
        pgf90 $(PGF_FLAGS) s_qep_tem.f90
        pgf90 $(PGF_FLAGS) s_qep_stem.f90
        pgf90 $(PGF_FLAGS) s_absorptive_tem.f90
        pgf90 $(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
        pgf90 $(PGF_FLAGS) *.f90
clean:
        rm -f *.o *.mod *.tmp *.TMP *.out

jat255 commented 5 years ago

Running the code however, it does not appear to want to run interactively, so I'm getting output like this with one of the tutorial files:

 $ mustem ./STEM_Al3Li_ABF_driver.txt
 |----------------------------------------------------------------------------|
 |              Melbourne University (scanning) transmission electron         |
 |                            microscopy computing suite                      |
 |      __       __  __    __   ______  ________  ________  __       __       |
 |     |  \     /  \|  \  |  \ /      \|        \|        \|  \     /  \      |
 |     | $$\   /  $$| $$  | $$|  $$$$$$\\$$$$$$$$| $$$$$$$$| $$\   /  $$      |
 |     | $$$\ /  $$$| $$  | $$| $$___\$$  | $$   | $$__    | $$$\ /  $$$      |
 |     | $$$$\  $$$$| $$  | $$ \$$    \   | $$   | $$  \   | $$$$\  $$$$      |
 |     | $$\$$ $$ $$| $$  | $$ _\$$$$$$\  | $$   | $$$$$   | $$\$$ $$ $$      |
 |     | $$ \$$$| $$| $$__/ $$|  \__| $$  | $$   | $$_____ | $$ \$$$| $$      |
 |     | $$  \$ | $$ \$$    $$ \$$    $$  | $$   | $$     \| $$  \$ | $$      |
 |      \$$      \$$  \$$$$$$   \$$$$$$    \$$    \$$$$$$$$ \$$      \$$      |
 |                                                                            |
 |       Copyright (C) 2017 L.J. Allen, H.G. Brown, A.J. D'Alfonso,           |
 |              S.D. Findlay, B.D. Forbes                                     |
 |       email: hamish.brown@monash.edu                                       |
 |       This program comes with ABSOLUTELY NO WARRANTY;                      |
 |                                                                            |
 |       This program is licensed to you under the terms of the GNU           |
 |       General Public License Version 3 as published by the Free            |
 |       Software Foundation.                                                 |
 |                                                                            |
 |       GPU Version 5.3                                                      |
 |                                                                            |
 |       Note: pass the argument "nopause" (without quotation marks)          |
 |             e.g. muSTEM.exe nopause                                        |
 |             to avoid pauses.                                               |
 |                                                                            |
 |----------------------------------------------------------------------------|

  Press enter to continue.

 |----------------------------|
 |     CPU multithreading     |
 |----------------------------|

 The number of threads being used on the CPU is: 4

 |----------------------------------|
 |           GPU selection          |
 |----------------------------------|

 You have one CUDA-capable device, with the following properties:

 Device Number: 0
   Device name: Tesla V100-SXM2-16GB
   Memory Clock Rate (MHz): 877
   Memory Bus Width (bits): 4096
   Peak Memory Bandwidth (GB/s): 898.05
   Total Global Memory (MB):   16911.43
   Compute capability: 7.0
 Enter <0> to continue.
 Wrong input string:
 Output filename
 Expected:
 Device used for calculation
 On line number:   1

Have you seen this behavior before?

bryandesser commented 5 years ago

@jat255 Glad to hear you got it compiled. I figured PGI probably had something included. Not sure how the performance compares, but it's probably not too far off. The screenshot you posted is expected behavior - if you have recorded your driving file using the CPU version of the code, you'll need to add 2 or 4 lines for it to work on the GPU. The first 2 lines at the top of the file should be:

Device used for calculation
0

This selects the device you want to use. It is really only necessary if you have a system with more than 1 available GPU, but that's fairly common on HPC systems. This, for example, means that if you submit a batch job on your cluster and have multiple GPUs, you could in theory submit one job to Device 0, one to Deivce 1, etc. and run them at the same time (though you would likely take a hit on performance by sharing CPU resources).

Depending on the type of simulation you are running - QEP only I believe, but don't quote me on that - you will also be asked whether or not you want to pre-calculate the potentials and hold them all in GPU memory, or calculate them on-the-fly if you don't have enough GPU memory for the entire simulation. Those lines are usually the last two lines of the driving file (or nearly there) and look like this:

<0> Precalculated potentials <1> On-the-fly calculation
0

jat255 commented 5 years ago

Thank you for the help! I'm very new to MuSTEM and trying to get it running on our cluster for a colleague, so I'm unfamiliar with how the input files work (although learning). Your tips got me to the running stage (I'm trying to use the STEM_Al3Li_ABF_driver.txt example from the Tutorials directory). Adding those two lines, it successfully completed the "Pre-calculation setup", but bombed as soon as the calculation started with line 175: cudaLaunchKernel returned status 98: invalid device function

I think this had something to do with using the wrong CUDA options at compile time, so I changed the Makefile to:

PGF_FLAGS= -c -g -O3 -Mpreprocess -Mbackslash -Mconcur -Mextend -Mcuda=cuda10.1 -Dsingle_precision -ta=tesla:cuda10.1 -Mfree -Mrecursive -DGPU

Hn0_tcmp: intermediate 
        pgf90 -Mcuda=cuda10.1 -ta=tesla:cuda10.1 -o MU_STEM.out *.o -L${CUDA_HOME}/lib -lcufft
modules:
        pgf90 $(PGF_FLAGS) quadpack.f90
        pgf90 $(PGF_FLAGS) m_precision.f90
        pgf90 $(PGF_FLAGS) m_string.f90
        pgf90 $(PGF_FLAGS) m_numerical_tools.f90
        pgf90 $(PGF_FLAGS) mod_global_variables.f90
        pgf90 $(PGF_FLAGS) m_crystallography.f90
        pgf90 $(PGF_FLAGS) m_electron.f90
        pgf90 $(PGF_FLAGS) m_user_input.f90
        pgf90 $(PGF_FLAGS) mod_cufft.f90
        pgf90 $(PGF_FLAGS) mod_CUFFT_wrapper.f90
        pgf90 $(PGF_FLAGS) mod_output.f90
        pgf90 $(PGF_FLAGS) m_multislice.f90
        pgf90 $(PGF_FLAGS) m_lens.f90
        pgf90 $(PGF_FLAGS) m_tilt.f90
        pgf90 $(PGF_FLAGS) m_absorption.f90
        pgf90 $(PGF_FLAGS) mod_cuda_array_library.f90
        pgf90 $(PGF_FLAGS) mod_cuda_potential.f90
        pgf90 $(PGF_FLAGS) m_potential.f90
        pgf90 $(PGF_FLAGS) MS_utilities.f90
        pgf90 $(PGF_FLAGS) mod_cuda_setup.f90
        pgf90 $(PGF_FLAGS) mod_cuda_ms.f90
        pgf90 $(PGF_FLAGS) s_absorptive_stem.f90
        pgf90 $(PGF_FLAGS) s_qep_tem.f90
        pgf90 $(PGF_FLAGS) s_qep_stem.f90
        pgf90 $(PGF_FLAGS) s_absorptive_tem.f90
        pgf90 $(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
        pgf90 $(PGF_FLAGS) *.f90
clean:
        rm -f *.o *.mod *.tmp *.TMP *.out

And I had success! I was able to get the following output from the Al3Li ABF example from the tutorial:

Are there any benchmarks for these tutorial examples? I'm curious how fast/slow this simulation was in relation to other systems. The total time elapsed for this example was 42 seconds.

bryandesser commented 5 years ago

@jat255 That's a good start! I can't say I know of any benchmarks out there, especially since there are some many different system combos. What GPU hardware do you have access to?

For me, the Al3Li ABF example (by itself, none of the others) took:

14s, Nvidia 2080Ti + Xeon W-2195 Quad core
35s, Nvidia K80 + Xeon E5-2680 12 core
37s, Xeon W-2195 Quad core (compiled with ifort)
45s, E5-2680 12 core (compiled with ifort)
67s, Xeon W-2195 Quad core (compiled with PGI)
102s, E5-2680 12 core (compiiled with PGI)

** The above times are TOTAL times to run the entire program, not the time that muSTEM spits out. I do this using time /path/to/muSTEM/code. For reference, to run all 5 examples on my 2080Ti took 62s.

The difference on the CPU side between Intel and PGI seems (in my limited experience) to be related at least in part to the way they both treat threading on the CPU. That's important even if you're running the GPU accelerated version of the code because a lot of the routines are still run on the CPU.

I would imagine that you may be able to tweak some flags and get a bit more performance out of your system, but that's just something you'll have to play with. Take a look at the PGI man page or these (somewhat dated) descriptions from Dartmouth and Mines, and then just rerun the same simulation with different compiler options set to see what gives you the best performance. Speed differences may be more apparent if you compile the CPU only version while you're sorting it out, and then add in -DGPU once you know what makefile works best.

jat255 commented 5 years ago

Thanks for the timings. Looks like I have some work do to optimizing. This was on one node of a cluster system, set up with one Nvidia Tesla V100 GPU (it can access up to four of them, but it seems like there's not a benefit to multiple GPUs with MuSTEM). Each node has two IBM POWER9 SMT4 CPUs, each with 20 cores (80 threads) and clocked at 2.25 GHz, so I'm sure I can get some good performance with some more tweaking.

Although... it's not promising that in your benchmarks the PGI compilations are the slowest of all of them!

Is there anything I should be doing in relation to MPI?

jat255 commented 5 years ago

@bryandesser I'm having some trouble compiling the CPU version, with errors related to

./builds/MuSTEM/source/CPU/mod_CUFFT_wrapper.f90:664: undefined reference to `sfftw_plan_dft_2d_'

I'm trying to use your CPU Makefile example, but customizing to my system:

PGF_FLAGS= -c -g -O3 -Mpreprocess -Mbackslash -Mconcur -Mextend -Mcuda=cuda10.1 -Dsingle_precision -ta=tesla:cuda10.1 -Mfree -Mrecursive -I/home/jat/.local/include

CUDA_HOME=/share/sw/pgi/linuxpower/2019/cuda/10.1/

Hn0_tcmp: intermediate 
    pgf90 -Mcuda=cuda10.1 -ta=tesla:cuda10.1 -o MU_STEM.CPU.out *.o -I/home/jat/.local/include -L${CUDA_HOME}/lib -lcufft
modules:
    pgf90 $(PGF_FLAGS) quadpack.f90
    pgf90 $(PGF_FLAGS) mod_CUFFT_wrapper.f90
    pgf90 $(PGF_FLAGS) m_precision.f90
    pgf90 $(PGF_FLAGS) m_string.f90
    pgf90 $(PGF_FLAGS) m_numerical_tools.f90
    pgf90 $(PGF_FLAGS) mod_global_variables.f90
    pgf90 $(PGF_FLAGS) m_crystallography.f90
    pgf90 $(PGF_FLAGS) m_electron.f90
    pgf90 $(PGF_FLAGS) m_user_input.f90
    pgf90 $(PGF_FLAGS) mod_output.f90
    pgf90 $(PGF_FLAGS) m_multislice.f90
    pgf90 $(PGF_FLAGS) m_lens.f90
    pgf90 $(PGF_FLAGS) m_tilt.f90
    pgf90 $(PGF_FLAGS) m_absorption.f90
    pgf90 $(PGF_FLAGS) m_potential.f90
    pgf90 $(PGF_FLAGS) MS_utilities.f90
    pgf90 $(PGF_FLAGS) s_absorptive_stem.f90
    pgf90 $(PGF_FLAGS) s_qep_tem.f90
    pgf90 $(PGF_FLAGS) s_qep_stem.f90
    pgf90 $(PGF_FLAGS) s_absorptive_tem.f90
    pgf90 $(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
    pgf90 $(PGF_FLAGS) *.f90
clean:
    rm -f *.o *.mod *.tmp *.TMP *.out

Any thoughts on what could be wrong? I'm a little unsure of why I need the mod_CUFFT_wrapper.f90 file, since there shouldn't be any CUDA with the CPU version, right?

bryandesser commented 5 years ago

@jat255 mod_CUFFT_wrapper.f90 has a preprocessor command that allows it to easily switch between GPU and CPU only versions of the code (#ifdef GPU ... #else). It helps to plan the different types of FFTs using an interface function that basically allows the code to use one call like fft() or ifft() and still work whether you using -Dsingle_precision or -Ddouble_precision. This also relies on having the proper FFTW3 files linked on your system.

A couple of things:

You can remove all of the -Mcuda and -ta flags and CUDA_HOME for a CPU only compilation
OpenBLAS doesn't have support for FFTs, which is where FFTW3 will come in
You should only need to have PGI and a math library (usually MKL, or just FFTW3 in your case) loaded to compile (module load pgi fftw)

I could replicate your errors using your makefile (and have seen this before on my own machine). It's simply a case of not linking correctly to the FFTW libraries. I slightly modified your makefile (below) and could compile and run just fine without the use of MKL.

The only thing you'll need to change is the directory associated with FFTW_DIR based on your machine. Just make sure the directory you select has both include/ and lib/ subdirectories, where include/ contains files such as fftw3.f, and lib/ contains the files libfftw3.a, libfftw3f, libfftw3_threads.a, libfftw3f_threads. These are the files that contain the FFT routines that were throwing the errors during linking.

It's probably also worth checking into the PGI manual page for optimization options - it's ~2.25x slower without MKL, and pgf90 was already almost 2x slower than ifort. I would just systematically cycle through combinations of them to see what's best.

FFTW_DIR = /usr/local/fftw/3.3.5-gcc/

PGF_FLAGS = -c -g -O3 -Mpreprocess -mp -Mbackslash -Mconcur -Mextend -Dsingle_precision -Mfree -Mrecursive -I${FFTW_DIR}include

FFTW_LIBS = -L${FFTW_DIR}lib -lfftw3f -lfftw3f_threads -lfftw3 -lfftw3_threads

Hn0_tcmp: intermediate
        pgf90 -o MU_STEM.CPU.out *.o ${FFTW_LIBS}
        rm -f *.o *.mod *.tmp *.TMP
modules:
        pgf90 $(PGF_FLAGS) quadpack.f90
        pgf90 $(PGF_FLAGS) mod_CUFFT_wrapper.f90
        pgf90 $(PGF_FLAGS) m_precision.f90
        pgf90 $(PGF_FLAGS) m_string.f90
        pgf90 $(PGF_FLAGS) m_numerical_tools.f90
        pgf90 $(PGF_FLAGS) mod_global_variables.f90
        pgf90 $(PGF_FLAGS) m_crystallography.f90
        pgf90 $(PGF_FLAGS) m_electron.f90
        pgf90 $(PGF_FLAGS) m_user_input.f90
        pgf90 $(PGF_FLAGS) mod_output.f90
        pgf90 $(PGF_FLAGS) m_multislice.f90
        pgf90 $(PGF_FLAGS) m_lens.f90
        pgf90 $(PGF_FLAGS) m_tilt.f90
        pgf90 $(PGF_FLAGS) m_absorption.f90
        pgf90 $(PGF_FLAGS) m_potential.f90
        pgf90 $(PGF_FLAGS) MS_utilities.f90
        pgf90 $(PGF_FLAGS) s_absorptive_stem.f90
        pgf90 $(PGF_FLAGS) s_qep_tem.f90
        pgf90 $(PGF_FLAGS) s_qep_stem.f90
        pgf90 $(PGF_FLAGS) s_absorptive_tem.f90
        pgf90 $(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
        pgf90 $(PGF_FLAGS) *.f90
clean:
        rm -f *.o *.mod *.tmp *.TMP *.out

jat255 commented 5 years ago

Thanks @bryandesser. Since this system does not have a module for FFTW, I had to compile that manually, but with that completed, I was able to compile the CPU version of the code successfully. I tried a few different optimization flags (O0, fast, O3, and O4), which resulted in the following timings for the Al3Li example (as measured by time):

Fri Aug 23 17:14:28 EDT 2019
`O0`	5m26.671s
`fast`	6m3.855s
`O3`	6m3.928s
`O4`	6m9.060s

So it looks like none of the compiler optimizations actually help anything with respect to the CPU routines. MuSTEM always says it's using 4 CPU threads. Is there a way to control that number? I don't see anything in the manual about it.

I've run into another issue though when trying to run the GPU code again. I've compiled with the same Makefile I posted above that worked for me, but now when I'm running the code, it's not detecting any GPUs, and telling me I have 0 MB of memory in the GPU memory section. When I run nvidia-smi however, I can see two GPUs attached that are completely free. Any idea why this could happen?

EDIT: Nevermind on that issue, I think I changed one of the source files when trying to compile for CPU, which was causing issues. I re-downloaded the GPU sources, and it seems to be working now.

bryandesser commented 5 years ago

@jat255 I'll throw out a few thoughts/suggestions, though they may not be worth much given that I'm not familiar with the Power architecture:

Those times are definitely very slow... my quad core @ 4.0GHz did it slowest with PGI w/o MKL in 1:47.31, so even with the lower clockspeeds and assuming no threading, you should do better than 5-6 minutes
When you ran ./configure for the FFTW3 build, did you add in flags for OpenMP (--enable-openmp) and theading (--enable-threads)? I don't believe you need the OpenMP one since it directly spawns its own threads using fftw3_threads, but I enabled it for good measure. I also assume this was all built correctly since you got it to compile/run, but it's worth a mention
You mentioned that the CPUs are "20 cores (80 threads)" each. If you go to the terminal and type htop, do you see all 80 threads? It may be the case that htop is not installed by default - you can also check with grep -c ^processor /proc/cpuinfo
Try the simple program below to see what OpenMP is returning for the number of threads and make sure that they are all responding. If that is only spitting out 4 threads, then it gives a better starting point.
RE you're earlier mention of MPI: the code doesn't support distributed memory, so you can use a max of 1 CPU at a time and 0 or 1 GPU. So the fact that your CPUs have so many cores/threads should be a good thing

  program thread_test

      integer(4) :: omp_get_max_threads, omp_get_num_procs, omp_get_thread_num

      write(*,*) 'omp_get_max_threads = ', omp_get_max_threads()
      write(*,*) 'omp_get_num_procs   = ', omp_get_num_procs()

      !$omp parallel
      write(*,*) 'thread #',omp_get_thread_num()
      !$omp end parallel

  end program

For my Xeon W2195 (4 cores, 8 threads) this gave the following (expected) output:

$ pgf90 -mp thread_test.f90
$ ./a.out
 omp_get_max_threads =            8
 omp_get_num_procs   =            8
 thread #           3
 thread #           7
 thread #           1
 thread #           4
 thread #           6
 thread #           0
 thread #           5
 thread #           2

jat255 commented 5 years ago

Thanks again for the feedback. I came to some of the same conclusion myself earlier today, noticing that I could provide the number of nodes (-N), tasks (-n), or CPUs per task (-c) to SLURM via srun. In htop, I can see all 160 cores (two CPUs with 80 cores each) on a node when I connect to an interactive session via bash. If I run the example code you provided with increasing -c, omp_max_threads increases in multiples of 4 accordingly:


`-c`	1	2	3	4	5	6	7	8	9	10	11	12
`omp_max_threads`	1	2	4	4	8	8	8	8	12	12	12	12

I did not enable OpenMP in the FFTW3 compilation, but I did do threading; I'll try that to see if it makes a difference. The environment I used for the configuration was:

env CC=pgcc CFLAGS="-fast -Minfo -fPIC" F77=pgfortran \
      FFLAGS="-fast -Minfo" ./configure --enable-threads \
      --enable-shared --enable-vsx --prefix=${HOME}/.local   # for double precision

and

env CC=pgcc CFLAGS="-fast -Minfo -fPIC" F77=pgfortran \
      FFLAGS="-fast -Minfo" ./configure --enable-threads \
      --enable-shared --enable-vsx --enable-single \
      --prefix=${HOME}/.local   # for single precision

Without OpenMP enabled for FFTW3, I obtained the following results when systematically increasing the -c parameter for SLURM on the CPU code:

`-c`	Real time
1	5m32.664s
2	5m36.368s
4	5m30.853s
8	5m17.383s
16	5m17.430s
32	5m44.462s
64	5m27.305s
128	5m22.341s

I'll recompile FFTW with the OpenMP support to see if that makes a difference

bryandesser commented 5 years ago

@jat255 that's an important point about what environment you're requesting via SLURM. If you see omp_max_threads increasing with increasing -c values, you should also see the value muSTEM spits out increase accordingly. It's a bit surprising that the timing doesn't then also scale with it. I'll be interested to hear if you get it to improve. At the end of the day, though, it may not really be worth it if the simulations you're looking to run can all fit into GPU memory since that timing looked reasonably fast @ 42s.

jat255 commented 5 years ago

I noticed something interesting when trying to recompile FFTW, in that even though I was loading the PGI compilers and setting my environment, make was using xlr_c as the C compiler! I apparently have run into this bug. Using the workaround there, I'm getting some errors when trying to actually compile with PGI related to altivec.h:

PGC-F-0249-#error --  Use the "-maltivec" flag to enable PowerPC AltiVec support (/usr/lib/gcc/ppc64le-redhat-linux/4.8.5/include/altivec.h: 34)
PGC/power Linux 19.4-0: compilation aborted

accompanied by warnings during ./configure:

configure: WARNING: altivec.h: present but cannot be compiled
configure: WARNING: altivec.h:     check for missing prerequisite headers?
configure: WARNING: altivec.h: see the Autoconf documentation
configure: WARNING: altivec.h:     section "Present But Cannot Be Compiled"
configure: WARNING: altivec.h: proceeding with the compiler's result
configure: WARNING:     ## ---------------------------- ##
configure: WARNING:     ## Report this to fftw@fftw.org ##
configure: WARNING:     ## ---------------------------- ##

I've posted on the PGI forums for more help hopefully, but I'm thinking this might be why it was so slow.

FYI, the total time (as measured by time) for the GPU calculation was 1m27s when I redid it; I reported 42 seconds since that was what MuSTEM originally showed.

jat255 commented 5 years ago

If you see omp_max_threads increasing with increasing -c values, you should also see the value muSTEM spits out increase accordingly.

This is indeed the case:

$ find . -name "mustem_CPU*" | sort -V | xargs grep "number of threads"
./mustem_CPU.1.out: The number of threads being used on the CPU is: 4
./mustem_CPU.2.out: The number of threads being used on the CPU is: 4
./mustem_CPU.4.out: The number of threads being used on the CPU is: 4
./mustem_CPU.8.out: The number of threads being used on the CPU is: 8
./mustem_CPU.16.out: The number of threads being used on the CPU is: 16
./mustem_CPU.32.out: The number of threads being used on the CPU is: 32
./mustem_CPU.64.out: The number of threads being used on the CPU is: 64
./mustem_CPU.128.out: The number of threads being used on the CPU is: 128

jat255 commented 5 years ago

So I think I've about tapped out my optimizations, and I'm getting pretty similar timings as before. One more question...

I'm getting a PGF90-S-0034-Syntax error on line 145 of muSTEM.f90:

open(6,carriagecontrol ='fortran')

If I comment this whole block out, I can compile the CPU version, but I'm not sure what effect that might have. Any idea why this would be failing?

bryandesser commented 5 years ago

@jat255 I see that error, too, and commenting it out is not a problem at all. I'm sure it's an easy fix, but I've never tried to work it out. The only thing it does is make the output during the simulation show up on one line vs printing all of the lines.

RE performance, the only other suggestion I have at the moment is to watch htop while running muSTEM on an interactive session and see how/if it is truly spreading across the number of threads that it displays at startup. That could give you an idea of how it's interacting with the architecture.

jat255 commented 5 years ago

@bryandesser gotcha. Glad my workaround was not a problem.

I think I'll close out this issue, but I took a look at htop during a run with 32 cores, and it looks like the actual calculation part is multi-threaded (the part after "Calculation running"), but the calculation of the absorptive scattering factors appears to only use a single thread. Any idea why this might be?

(Sorry for the huge gif, but wanted to show you what it looks like while it's running) During this run, the 4 cores at the top are a different user. My allocation was CPUs ~9 through 40, it looks like.

MuSTE_CPU_spedup

jat255 commented 5 years ago

For final "posterity", I'll leave my build configuration here in case it's of use to someone in the future. The SLURM commands are specific to my system, of course, but I think this should help anyone that's working on a PowerPC system:

CPU version

The CPU version of the code requires a math library (FFTW was used in this example). Change FFTW_DIR to wherever you installed the FFTW libraries

$ git clone https://github.com/HamishGBrown/MuSTEM.git
$ cd MuSTEM/Source
$ srun --pty --partition=debug --time=1:00:00 bash

At this point, I had to replace line 145 of muSTEM.f90 with the statement continue, because pgf90 complained about the syntax of this line.

$ module load pgi
$ echo -e "# change this to wherever your FFTW is installed
FFTW_DIR = /home/jat/install/fftw_gcc920/

PGF_FLAGS = -c -g -O3 -Mpreprocess -mp -Mbackslash -Mconcur -Mextend -Dsingle_precision -Mfree -Mrecursive -I\${FFTW_DIR}include
FFTW_LIBS = -L\${FFTW_DIR}lib -lfftw3f -lfftw3f_threads -lfftw3 -lfftw3_threads

Hn0_tcmp: intermediate
\t# Change this path to control where the executable is written
\tpgf90 -o ../MU_STEM.CPU.out *.o \${FFTW_LIBS}   
\trm -f *.o *.mod *.tmp *.TMP
modules:
\tpgf90 \$(PGF_FLAGS) quadpack.f90
\tpgf90 \$(PGF_FLAGS) mod_CUFFT_wrapper.f90
\tpgf90 \$(PGF_FLAGS) m_precision.f90
\tpgf90 \$(PGF_FLAGS) m_string.f90
\tpgf90 \$(PGF_FLAGS) m_numerical_tools.f90
\tpgf90 \$(PGF_FLAGS) mod_global_variables.f90
\tpgf90 \$(PGF_FLAGS) m_crystallography.f90
\tpgf90 \$(PGF_FLAGS) m_electron.f90
\tpgf90 \$(PGF_FLAGS) m_user_input.f90
\tpgf90 \$(PGF_FLAGS) mod_output.f90
\tpgf90 \$(PGF_FLAGS) m_multislice.f90
\tpgf90 \$(PGF_FLAGS) m_lens.f90
\tpgf90 \$(PGF_FLAGS) m_tilt.f90
\tpgf90 \$(PGF_FLAGS) m_absorption.f90
\tpgf90 \$(PGF_FLAGS) m_potential.f90
\tpgf90 \$(PGF_FLAGS) MS_utilities.f90
\tpgf90 \$(PGF_FLAGS) s_absorptive_stem.f90
\tpgf90 \$(PGF_FLAGS) s_qep_tem.f90
\tpgf90 \$(PGF_FLAGS) s_qep_stem.f90
\tpgf90 \$(PGF_FLAGS) s_absorptive_tem.f90
\tpgf90 \$(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
\tpgf90 \$(PGF_FLAGS) *.f90
clean:
\trm -f *.o *.mod *.tmp *.TMP *.out" > Makefile
$ make

This puts an executable named MU_STEM.CPU.out into the μSTEM source directory.

GPU version

Since the CUDA libraries include the FFT code, we do not need to link to FFTW in order to build. From the head node in the MuSTEM/Source directory:

$ srun --pty --partition=debug --time=1:00:00 --gres=gpu:1 bash
$ ln -s GPU_routines/* .   # required to put GPU code into same folder as the rest of the code; 
                           # these links will conflict with a CPU-only compilation, so remove them if you
                           # need to compile the CPU version again
$ module load pgi
$ CUDA_HOME=/share/sw/pgi/linuxpower/2019/cuda/10.1/
$ echo -e "PGF_FLAGS= -c -g -O3 -Mpreprocess -Mbackslash -Mconcur -Mextend -Mcuda=cuda10.1 -Dsingle_precision -ta=tesla:cuda10.1 -Mfree -Mrecursive -DGPU

Hn0_tcmp: intermediate 
\t  pgf90 -Mcuda=cuda10.1 -ta=tesla:cuda10.1 -o ../MU_STEM.GPU.out *.o -L${CUDA_HOME}/lib -lcufft
modules:
\t  pgf90 \$(PGF_FLAGS) quadpack.f90
\t  pgf90 \$(PGF_FLAGS) m_precision.f90
\t  pgf90 \$(PGF_FLAGS) m_string.f90
\t  pgf90 \$(PGF_FLAGS) m_numerical_tools.f90
\t  pgf90 \$(PGF_FLAGS) mod_global_variables.f90
\t  pgf90 \$(PGF_FLAGS) m_crystallography.f90
\t  pgf90 \$(PGF_FLAGS) m_electron.f90
\t  pgf90 \$(PGF_FLAGS) m_user_input.f90
\t  pgf90 \$(PGF_FLAGS) mod_cufft.f90
\t  pgf90 \$(PGF_FLAGS) mod_CUFFT_wrapper.f90
\t  pgf90 \$(PGF_FLAGS) mod_output.f90
\t  pgf90 \$(PGF_FLAGS) m_multislice.f90
\t  pgf90 \$(PGF_FLAGS) m_lens.f90
\t  pgf90 \$(PGF_FLAGS) m_tilt.f90
\t  pgf90 \$(PGF_FLAGS) m_absorption.f90
\t  pgf90 \$(PGF_FLAGS) mod_cuda_array_library.f90
\t  pgf90 \$(PGF_FLAGS) mod_cuda_potential.f90
\t  pgf90 \$(PGF_FLAGS) m_potential.f90
\t  pgf90 \$(PGF_FLAGS) MS_utilities.f90
\t  pgf90 \$(PGF_FLAGS) mod_cuda_setup.f90
\t  pgf90 \$(PGF_FLAGS) mod_cuda_ms.f90
\t  pgf90 \$(PGF_FLAGS) s_absorptive_stem.f90
\t  pgf90 \$(PGF_FLAGS) s_qep_tem.f90
\t  pgf90 \$(PGF_FLAGS) s_qep_stem.f90
\t  pgf90 \$(PGF_FLAGS) s_absorptive_tem.f90
\t  pgf90 \$(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
\t  pgf90 \$(PGF_FLAGS) *.f90
clean:
\t  rm -f *.o *.mod *.tmp *.TMP *.out" > Makefile_GPU
$ make -f Makefile_GPU clean
$ make -f Makefile_GPU