Closed jat255 closed 5 years ago
@jat255 I'm not familiar with working on Power9, but I have a good bit of experience compiling on clusters. Happy to help where possible
Okay, I appear to have gotten it compiled using the following Makefile. I did not use OpenBLAS specifically, since it looks like the PGI compiler has a libblas.so
file, so I'm hoping that's good enough.
Makefile:
PGF_FLAGS= -c -g -O3 -Mpreprocess -Mbackslash -Mconcur -Mextend -Mcuda=cuda10.1 -Dsingle_precision -ta=tesla -Mfree -Mrecursive -DGPU
Hn0_tcmp: intermediate
pgf90 -Mcuda=cuda10.1 -o MU_STEM.out *.o -L${CUDA_HOME}/lib -lcufft $(THREAD_LINK)
modules:
pgf90 $(PGF_FLAGS) quadpack.f90
pgf90 $(PGF_FLAGS) m_precision.f90
pgf90 $(PGF_FLAGS) m_string.f90
pgf90 $(PGF_FLAGS) m_numerical_tools.f90
pgf90 $(PGF_FLAGS) mod_global_variables.f90
pgf90 $(PGF_FLAGS) m_crystallography.f90
pgf90 $(PGF_FLAGS) m_electron.f90
pgf90 $(PGF_FLAGS) m_user_input.f90
pgf90 $(PGF_FLAGS) mod_cufft.f90
pgf90 $(PGF_FLAGS) mod_CUFFT_wrapper.f90
pgf90 $(PGF_FLAGS) mod_output.f90
pgf90 $(PGF_FLAGS) m_multislice.f90
pgf90 $(PGF_FLAGS) m_lens.f90
pgf90 $(PGF_FLAGS) m_tilt.f90
pgf90 $(PGF_FLAGS) m_absorption.f90
pgf90 $(PGF_FLAGS) mod_cuda_array_library.f90
pgf90 $(PGF_FLAGS) mod_cuda_potential.f90
pgf90 $(PGF_FLAGS) m_potential.f90
pgf90 $(PGF_FLAGS) MS_utilities.f90
pgf90 $(PGF_FLAGS) mod_cuda_setup.f90
pgf90 $(PGF_FLAGS) mod_cuda_ms.f90
pgf90 $(PGF_FLAGS) s_absorptive_stem.f90
pgf90 $(PGF_FLAGS) s_qep_tem.f90
pgf90 $(PGF_FLAGS) s_qep_stem.f90
pgf90 $(PGF_FLAGS) s_absorptive_tem.f90
pgf90 $(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
pgf90 $(PGF_FLAGS) *.f90
clean:
rm -f *.o *.mod *.tmp *.TMP *.out
Running the code however, it does not appear to want to run interactively, so I'm getting output like this with one of the tutorial files:
$ mustem ./STEM_Al3Li_ABF_driver.txt
|----------------------------------------------------------------------------|
| Melbourne University (scanning) transmission electron |
| microscopy computing suite |
| __ __ __ __ ______ ________ ________ __ __ |
| | \ / \| \ | \ / \| \| \| \ / \ |
| | $$\ / $$| $$ | $$| $$$$$$\\$$$$$$$$| $$$$$$$$| $$\ / $$ |
| | $$$\ / $$$| $$ | $$| $$___\$$ | $$ | $$__ | $$$\ / $$$ |
| | $$$$\ $$$$| $$ | $$ \$$ \ | $$ | $$ \ | $$$$\ $$$$ |
| | $$\$$ $$ $$| $$ | $$ _\$$$$$$\ | $$ | $$$$$ | $$\$$ $$ $$ |
| | $$ \$$$| $$| $$__/ $$| \__| $$ | $$ | $$_____ | $$ \$$$| $$ |
| | $$ \$ | $$ \$$ $$ \$$ $$ | $$ | $$ \| $$ \$ | $$ |
| \$$ \$$ \$$$$$$ \$$$$$$ \$$ \$$$$$$$$ \$$ \$$ |
| |
| Copyright (C) 2017 L.J. Allen, H.G. Brown, A.J. D'Alfonso, |
| S.D. Findlay, B.D. Forbes |
| email: hamish.brown@monash.edu |
| This program comes with ABSOLUTELY NO WARRANTY; |
| |
| This program is licensed to you under the terms of the GNU |
| General Public License Version 3 as published by the Free |
| Software Foundation. |
| |
| GPU Version 5.3 |
| |
| Note: pass the argument "nopause" (without quotation marks) |
| e.g. muSTEM.exe nopause |
| to avoid pauses. |
| |
|----------------------------------------------------------------------------|
Press enter to continue.
|----------------------------|
| CPU multithreading |
|----------------------------|
The number of threads being used on the CPU is: 4
|----------------------------------|
| GPU selection |
|----------------------------------|
You have one CUDA-capable device, with the following properties:
Device Number: 0
Device name: Tesla V100-SXM2-16GB
Memory Clock Rate (MHz): 877
Memory Bus Width (bits): 4096
Peak Memory Bandwidth (GB/s): 898.05
Total Global Memory (MB): 16911.43
Compute capability: 7.0
Enter <0> to continue.
Wrong input string:
Output filename
Expected:
Device used for calculation
On line number: 1
Have you seen this behavior before?
@jat255 Glad to hear you got it compiled. I figured PGI probably had something included. Not sure how the performance compares, but it's probably not too far off. The screenshot you posted is expected behavior - if you have recorded your driving file using the CPU version of the code, you'll need to add 2 or 4 lines for it to work on the GPU. The first 2 lines at the top of the file should be:
Device used for calculation
0
This selects the device you want to use. It is really only necessary if you have a system with more than 1 available GPU, but that's fairly common on HPC systems. This, for example, means that if you submit a batch job on your cluster and have multiple GPUs, you could in theory submit one job to Device 0, one to Deivce 1, etc. and run them at the same time (though you would likely take a hit on performance by sharing CPU resources).
Depending on the type of simulation you are running - QEP only I believe, but don't quote me on that - you will also be asked whether or not you want to pre-calculate the potentials and hold them all in GPU memory, or calculate them on-the-fly if you don't have enough GPU memory for the entire simulation. Those lines are usually the last two lines of the driving file (or nearly there) and look like this:
<0> Precalculated potentials <1> On-the-fly calculation
0
Thank you for the help! I'm very new to MuSTEM and trying to get it running on our cluster for a colleague, so I'm unfamiliar with how the input files work (although learning). Your tips got me to the running stage (I'm trying to use the STEM_Al3Li_ABF_driver.txt
example from the Tutorials directory). Adding those two lines, it successfully completed the "Pre-calculation setup", but bombed as soon as the calculation started with line 175: cudaLaunchKernel returned status 98: invalid device function
I think this had something to do with using the wrong CUDA options at compile time, so I changed the Makefile to:
PGF_FLAGS= -c -g -O3 -Mpreprocess -Mbackslash -Mconcur -Mextend -Mcuda=cuda10.1 -Dsingle_precision -ta=tesla:cuda10.1 -Mfree -Mrecursive -DGPU
Hn0_tcmp: intermediate
pgf90 -Mcuda=cuda10.1 -ta=tesla:cuda10.1 -o MU_STEM.out *.o -L${CUDA_HOME}/lib -lcufft
modules:
pgf90 $(PGF_FLAGS) quadpack.f90
pgf90 $(PGF_FLAGS) m_precision.f90
pgf90 $(PGF_FLAGS) m_string.f90
pgf90 $(PGF_FLAGS) m_numerical_tools.f90
pgf90 $(PGF_FLAGS) mod_global_variables.f90
pgf90 $(PGF_FLAGS) m_crystallography.f90
pgf90 $(PGF_FLAGS) m_electron.f90
pgf90 $(PGF_FLAGS) m_user_input.f90
pgf90 $(PGF_FLAGS) mod_cufft.f90
pgf90 $(PGF_FLAGS) mod_CUFFT_wrapper.f90
pgf90 $(PGF_FLAGS) mod_output.f90
pgf90 $(PGF_FLAGS) m_multislice.f90
pgf90 $(PGF_FLAGS) m_lens.f90
pgf90 $(PGF_FLAGS) m_tilt.f90
pgf90 $(PGF_FLAGS) m_absorption.f90
pgf90 $(PGF_FLAGS) mod_cuda_array_library.f90
pgf90 $(PGF_FLAGS) mod_cuda_potential.f90
pgf90 $(PGF_FLAGS) m_potential.f90
pgf90 $(PGF_FLAGS) MS_utilities.f90
pgf90 $(PGF_FLAGS) mod_cuda_setup.f90
pgf90 $(PGF_FLAGS) mod_cuda_ms.f90
pgf90 $(PGF_FLAGS) s_absorptive_stem.f90
pgf90 $(PGF_FLAGS) s_qep_tem.f90
pgf90 $(PGF_FLAGS) s_qep_stem.f90
pgf90 $(PGF_FLAGS) s_absorptive_tem.f90
pgf90 $(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
pgf90 $(PGF_FLAGS) *.f90
clean:
rm -f *.o *.mod *.tmp *.TMP *.out
And I had success! I was able to get the following output from the Al3Li ABF example from the tutorial:
Are there any benchmarks for these tutorial examples? I'm curious how fast/slow this simulation was in relation to other systems. The total time elapsed for this example was 42 seconds.
@jat255 That's a good start! I can't say I know of any benchmarks out there, especially since there are some many different system combos. What GPU hardware do you have access to?
For me, the Al3Li ABF example (by itself, none of the others) took:
** The above times are TOTAL times to run the entire program, not the time that muSTEM spits out. I do this using time /path/to/muSTEM/code
. For reference, to run all 5 examples on my 2080Ti took 62s.
The difference on the CPU side between Intel and PGI seems (in my limited experience) to be related at least in part to the way they both treat threading on the CPU. That's important even if you're running the GPU accelerated version of the code because a lot of the routines are still run on the CPU.
I would imagine that you may be able to tweak some flags and get a bit more performance out of your system, but that's just something you'll have to play with. Take a look at the PGI man page or these (somewhat dated) descriptions from Dartmouth and Mines, and then just rerun the same simulation with different compiler options set to see what gives you the best performance. Speed differences may be more apparent if you compile the CPU only version while you're sorting it out, and then add in -DGPU
once you know what makefile
works best.
Thanks for the timings. Looks like I have some work do to optimizing. This was on one node of a cluster system, set up with one Nvidia Tesla V100 GPU (it can access up to four of them, but it seems like there's not a benefit to multiple GPUs with MuSTEM). Each node has two IBM POWER9 SMT4 CPUs, each with 20 cores (80 threads) and clocked at 2.25 GHz, so I'm sure I can get some good performance with some more tweaking.
Although... it's not promising that in your benchmarks the PGI compilations are the slowest of all of them!
Is there anything I should be doing in relation to MPI?
@bryandesser I'm having some trouble compiling the CPU version, with errors related to
./builds/MuSTEM/source/CPU/mod_CUFFT_wrapper.f90:664: undefined reference to `sfftw_plan_dft_2d_'
I'm trying to use your CPU Makefile example, but customizing to my system:
PGF_FLAGS= -c -g -O3 -Mpreprocess -Mbackslash -Mconcur -Mextend -Mcuda=cuda10.1 -Dsingle_precision -ta=tesla:cuda10.1 -Mfree -Mrecursive -I/home/jat/.local/include
CUDA_HOME=/share/sw/pgi/linuxpower/2019/cuda/10.1/
Hn0_tcmp: intermediate
pgf90 -Mcuda=cuda10.1 -ta=tesla:cuda10.1 -o MU_STEM.CPU.out *.o -I/home/jat/.local/include -L${CUDA_HOME}/lib -lcufft
modules:
pgf90 $(PGF_FLAGS) quadpack.f90
pgf90 $(PGF_FLAGS) mod_CUFFT_wrapper.f90
pgf90 $(PGF_FLAGS) m_precision.f90
pgf90 $(PGF_FLAGS) m_string.f90
pgf90 $(PGF_FLAGS) m_numerical_tools.f90
pgf90 $(PGF_FLAGS) mod_global_variables.f90
pgf90 $(PGF_FLAGS) m_crystallography.f90
pgf90 $(PGF_FLAGS) m_electron.f90
pgf90 $(PGF_FLAGS) m_user_input.f90
pgf90 $(PGF_FLAGS) mod_output.f90
pgf90 $(PGF_FLAGS) m_multislice.f90
pgf90 $(PGF_FLAGS) m_lens.f90
pgf90 $(PGF_FLAGS) m_tilt.f90
pgf90 $(PGF_FLAGS) m_absorption.f90
pgf90 $(PGF_FLAGS) m_potential.f90
pgf90 $(PGF_FLAGS) MS_utilities.f90
pgf90 $(PGF_FLAGS) s_absorptive_stem.f90
pgf90 $(PGF_FLAGS) s_qep_tem.f90
pgf90 $(PGF_FLAGS) s_qep_stem.f90
pgf90 $(PGF_FLAGS) s_absorptive_tem.f90
pgf90 $(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
pgf90 $(PGF_FLAGS) *.f90
clean:
rm -f *.o *.mod *.tmp *.TMP *.out
Any thoughts on what could be wrong? I'm a little unsure of why I need the mod_CUFFT_wrapper.f90
file, since there shouldn't be any CUDA with the CPU version, right?
@jat255 mod_CUFFT_wrapper.f90
has a preprocessor command that allows it to easily switch between GPU and CPU only versions of the code (#ifdef GPU ... #else
). It helps to plan the different types of FFTs using an interface
function that basically allows the code to use one call like fft()
or ifft()
and still work whether you using -Dsingle_precision
or -Ddouble_precision
. This also relies on having the proper FFTW3 files linked on your system.
A couple of things:
-Mcuda
and -ta
flags and CUDA_HOME
for a CPU only compilationmodule load pgi fftw
)I could replicate your errors using your makefile (and have seen this before on my own machine). It's simply a case of not linking correctly to the FFTW libraries. I slightly modified your makefile (below) and could compile and run just fine without the use of MKL.
The only thing you'll need to change is the directory associated with FFTW_DIR
based on your machine. Just make sure the directory you select has both include/
and lib/
subdirectories, where include/
contains files such as fftw3.f
, and lib/
contains the files libfftw3.a
, libfftw3f
, libfftw3_threads.a
, libfftw3f_threads
. These are the files that contain the FFT routines that were throwing the errors during linking.
It's probably also worth checking into the PGI manual page for optimization options - it's ~2.25x slower without MKL, and pgf90
was already almost 2x slower than ifort
. I would just systematically cycle through combinations of them to see what's best.
FFTW_DIR = /usr/local/fftw/3.3.5-gcc/
PGF_FLAGS = -c -g -O3 -Mpreprocess -mp -Mbackslash -Mconcur -Mextend -Dsingle_precision -Mfree -Mrecursive -I${FFTW_DIR}include
FFTW_LIBS = -L${FFTW_DIR}lib -lfftw3f -lfftw3f_threads -lfftw3 -lfftw3_threads
Hn0_tcmp: intermediate
pgf90 -o MU_STEM.CPU.out *.o ${FFTW_LIBS}
rm -f *.o *.mod *.tmp *.TMP
modules:
pgf90 $(PGF_FLAGS) quadpack.f90
pgf90 $(PGF_FLAGS) mod_CUFFT_wrapper.f90
pgf90 $(PGF_FLAGS) m_precision.f90
pgf90 $(PGF_FLAGS) m_string.f90
pgf90 $(PGF_FLAGS) m_numerical_tools.f90
pgf90 $(PGF_FLAGS) mod_global_variables.f90
pgf90 $(PGF_FLAGS) m_crystallography.f90
pgf90 $(PGF_FLAGS) m_electron.f90
pgf90 $(PGF_FLAGS) m_user_input.f90
pgf90 $(PGF_FLAGS) mod_output.f90
pgf90 $(PGF_FLAGS) m_multislice.f90
pgf90 $(PGF_FLAGS) m_lens.f90
pgf90 $(PGF_FLAGS) m_tilt.f90
pgf90 $(PGF_FLAGS) m_absorption.f90
pgf90 $(PGF_FLAGS) m_potential.f90
pgf90 $(PGF_FLAGS) MS_utilities.f90
pgf90 $(PGF_FLAGS) s_absorptive_stem.f90
pgf90 $(PGF_FLAGS) s_qep_tem.f90
pgf90 $(PGF_FLAGS) s_qep_stem.f90
pgf90 $(PGF_FLAGS) s_absorptive_tem.f90
pgf90 $(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
pgf90 $(PGF_FLAGS) *.f90
clean:
rm -f *.o *.mod *.tmp *.TMP *.out
Thanks @bryandesser. Since this system does not have a module for FFTW, I had to compile that manually, but with that completed, I was able to compile the CPU version of the code successfully. I tried a few different optimization flags (O0, fast, O3, and O4), which resulted in the following timings for the Al3Li example (as measured by time
):
Fri Aug 23 17:14:28 EDT 2019 | |
---|---|
O0 |
5m26.671s |
fast |
6m3.855s |
O3 |
6m3.928s |
O4 |
6m9.060s |
So it looks like none of the compiler optimizations actually help anything with respect to the CPU routines. MuSTEM always says it's using 4 CPU threads. Is there a way to control that number? I don't see anything in the manual about it.
I've run into another issue though when trying to run the GPU code again. I've compiled with the same Makefile I posted above that worked for me, but now when I'm running the code, it's not detecting any GPUs, and telling me I have 0 MB of memory in the GPU memory
section. When I run nvidia-smi
however, I can see two GPUs attached that are completely free. Any idea why this could happen?
EDIT: Nevermind on that issue, I think I changed one of the source files when trying to compile for CPU, which was causing issues. I re-downloaded the GPU sources, and it seems to be working now.
@jat255 I'll throw out a few thoughts/suggestions, though they may not be worth much given that I'm not familiar with the Power architecture:
./configure
for the FFTW3 build, did you add in flags for OpenMP (--enable-openmp
) and theading (--enable-threads
)? I don't believe you need the OpenMP one since it directly spawns its own threads using fftw3_threads
, but I enabled it for good measure. I also assume this was all built correctly since you got it to compile/run, but it's worth a mentionhtop
, do you see all 80 threads? It may be the case that htop
is not installed by default - you can also check with grep -c ^processor /proc/cpuinfo
program thread_test
integer(4) :: omp_get_max_threads, omp_get_num_procs, omp_get_thread_num
write(*,*) 'omp_get_max_threads = ', omp_get_max_threads()
write(*,*) 'omp_get_num_procs = ', omp_get_num_procs()
!$omp parallel
write(*,*) 'thread #',omp_get_thread_num()
!$omp end parallel
end program
For my Xeon W2195 (4 cores, 8 threads) this gave the following (expected) output:
$ pgf90 -mp thread_test.f90
$ ./a.out
omp_get_max_threads = 8
omp_get_num_procs = 8
thread # 3
thread # 7
thread # 1
thread # 4
thread # 6
thread # 0
thread # 5
thread # 2
Thanks again for the feedback. I came to some of the same conclusion myself earlier today, noticing that I could provide the number of nodes (-N
), tasks (-n
), or CPUs per task (-c
) to SLURM via srun
. In htop
, I can see all 160 cores (two CPUs with 80 cores each) on a node when I connect to an interactive session via bash. If I run the example code you provided with increasing -c
, omp_max_threads
increases in multiples of 4 accordingly:
-c |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
omp_max_threads |
1 | 2 | 4 | 4 | 8 | 8 | 8 | 8 | 12 | 12 | 12 | 12 |
I did not enable OpenMP in the FFTW3 compilation, but I did do threading; I'll try that to see if it makes a difference. The environment I used for the configuration was:
env CC=pgcc CFLAGS="-fast -Minfo -fPIC" F77=pgfortran \
FFLAGS="-fast -Minfo" ./configure --enable-threads \
--enable-shared --enable-vsx --prefix=${HOME}/.local # for double precision
and
env CC=pgcc CFLAGS="-fast -Minfo -fPIC" F77=pgfortran \
FFLAGS="-fast -Minfo" ./configure --enable-threads \
--enable-shared --enable-vsx --enable-single \
--prefix=${HOME}/.local # for single precision
Without OpenMP enabled for FFTW3, I obtained the following results when systematically increasing the -c
parameter for SLURM on the CPU code:
-c |
Real time |
---|---|
1 | 5m32.664s |
2 | 5m36.368s |
4 | 5m30.853s |
8 | 5m17.383s |
16 | 5m17.430s |
32 | 5m44.462s |
64 | 5m27.305s |
128 | 5m22.341s |
I'll recompile FFTW with the OpenMP support to see if that makes a difference
@jat255 that's an important point about what environment you're requesting via SLURM. If you see omp_max_threads
increasing with increasing -c
values, you should also see the value muSTEM spits out increase accordingly. It's a bit surprising that the timing doesn't then also scale with it. I'll be interested to hear if you get it to improve. At the end of the day, though, it may not really be worth it if the simulations you're looking to run can all fit into GPU memory since that timing looked reasonably fast @ 42s.
I noticed something interesting when trying to recompile FFTW, in that even though I was loading the PGI compilers and setting my environment, make
was using xlr_c
as the C compiler! I apparently have run into this bug. Using the workaround there, I'm getting some errors when trying to actually compile with PGI related to altivec.h
:
PGC-F-0249-#error -- Use the "-maltivec" flag to enable PowerPC AltiVec support (/usr/lib/gcc/ppc64le-redhat-linux/4.8.5/include/altivec.h: 34)
PGC/power Linux 19.4-0: compilation aborted
accompanied by warnings during ./configure
:
configure: WARNING: altivec.h: present but cannot be compiled
configure: WARNING: altivec.h: check for missing prerequisite headers?
configure: WARNING: altivec.h: see the Autoconf documentation
configure: WARNING: altivec.h: section "Present But Cannot Be Compiled"
configure: WARNING: altivec.h: proceeding with the compiler's result
configure: WARNING: ## ---------------------------- ##
configure: WARNING: ## Report this to fftw@fftw.org ##
configure: WARNING: ## ---------------------------- ##
I've posted on the PGI forums for more help hopefully, but I'm thinking this might be why it was so slow.
FYI, the total time (as measured by time
) for the GPU calculation was 1m27s when I redid it; I reported 42 seconds since that was what MuSTEM originally showed.
If you see
omp_max_threads
increasing with increasing-c
values, you should also see the value muSTEM spits out increase accordingly.
This is indeed the case:
$ find . -name "mustem_CPU*" | sort -V | xargs grep "number of threads"
./mustem_CPU.1.out: The number of threads being used on the CPU is: 4
./mustem_CPU.2.out: The number of threads being used on the CPU is: 4
./mustem_CPU.4.out: The number of threads being used on the CPU is: 4
./mustem_CPU.8.out: The number of threads being used on the CPU is: 8
./mustem_CPU.16.out: The number of threads being used on the CPU is: 16
./mustem_CPU.32.out: The number of threads being used on the CPU is: 32
./mustem_CPU.64.out: The number of threads being used on the CPU is: 64
./mustem_CPU.128.out: The number of threads being used on the CPU is: 128
So I think I've about tapped out my optimizations, and I'm getting pretty similar timings as before. One more question...
I'm getting a PGF90-S-0034-Syntax error
on line 145 of muSTEM.f90
:
open(6,carriagecontrol ='fortran')
If I comment this whole block out, I can compile the CPU version, but I'm not sure what effect that might have. Any idea why this would be failing?
@jat255 I see that error, too, and commenting it out is not a problem at all. I'm sure it's an easy fix, but I've never tried to work it out. The only thing it does is make the output during the simulation show up on one line vs printing all of the lines.
RE performance, the only other suggestion I have at the moment is to watch htop
while running muSTEM on an interactive session and see how/if it is truly spreading across the number of threads that it displays at startup. That could give you an idea of how it's interacting with the architecture.
@bryandesser gotcha. Glad my workaround was not a problem.
I think I'll close out this issue, but I took a look at htop
during a run with 32 cores, and it looks like the actual calculation part is multi-threaded (the part after "Calculation running"), but the calculation of the absorptive scattering factors appears to only use a single thread. Any idea why this might be?
(Sorry for the huge gif, but wanted to show you what it looks like while it's running) During this run, the 4 cores at the top are a different user. My allocation was CPUs ~9 through 40, it looks like.
For final "posterity", I'll leave my build configuration here in case it's of use to someone in the future. The SLURM commands are specific to my system, of course, but I think this should help anyone that's working on a PowerPC system:
The CPU version of the code requires a math library (FFTW was used in this example). Change FFTW_DIR
to wherever you installed the FFTW libraries
$ git clone https://github.com/HamishGBrown/MuSTEM.git
$ cd MuSTEM/Source
$ srun --pty --partition=debug --time=1:00:00 bash
At this point, I had to replace line 145 of muSTEM.f90
with the statement continue
, because pgf90
complained about the syntax of this line.
$ module load pgi
$ echo -e "# change this to wherever your FFTW is installed
FFTW_DIR = /home/jat/install/fftw_gcc920/
PGF_FLAGS = -c -g -O3 -Mpreprocess -mp -Mbackslash -Mconcur -Mextend -Dsingle_precision -Mfree -Mrecursive -I\${FFTW_DIR}include
FFTW_LIBS = -L\${FFTW_DIR}lib -lfftw3f -lfftw3f_threads -lfftw3 -lfftw3_threads
Hn0_tcmp: intermediate
\t# Change this path to control where the executable is written
\tpgf90 -o ../MU_STEM.CPU.out *.o \${FFTW_LIBS}
\trm -f *.o *.mod *.tmp *.TMP
modules:
\tpgf90 \$(PGF_FLAGS) quadpack.f90
\tpgf90 \$(PGF_FLAGS) mod_CUFFT_wrapper.f90
\tpgf90 \$(PGF_FLAGS) m_precision.f90
\tpgf90 \$(PGF_FLAGS) m_string.f90
\tpgf90 \$(PGF_FLAGS) m_numerical_tools.f90
\tpgf90 \$(PGF_FLAGS) mod_global_variables.f90
\tpgf90 \$(PGF_FLAGS) m_crystallography.f90
\tpgf90 \$(PGF_FLAGS) m_electron.f90
\tpgf90 \$(PGF_FLAGS) m_user_input.f90
\tpgf90 \$(PGF_FLAGS) mod_output.f90
\tpgf90 \$(PGF_FLAGS) m_multislice.f90
\tpgf90 \$(PGF_FLAGS) m_lens.f90
\tpgf90 \$(PGF_FLAGS) m_tilt.f90
\tpgf90 \$(PGF_FLAGS) m_absorption.f90
\tpgf90 \$(PGF_FLAGS) m_potential.f90
\tpgf90 \$(PGF_FLAGS) MS_utilities.f90
\tpgf90 \$(PGF_FLAGS) s_absorptive_stem.f90
\tpgf90 \$(PGF_FLAGS) s_qep_tem.f90
\tpgf90 \$(PGF_FLAGS) s_qep_stem.f90
\tpgf90 \$(PGF_FLAGS) s_absorptive_tem.f90
\tpgf90 \$(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
\tpgf90 \$(PGF_FLAGS) *.f90
clean:
\trm -f *.o *.mod *.tmp *.TMP *.out" > Makefile
$ make
This puts an executable named MU_STEM.CPU.out
into the μSTEM source directory.
Since the CUDA libraries include the FFT code, we do not need to link to FFTW in order to build. From the head node in the MuSTEM/Source
directory:
$ srun --pty --partition=debug --time=1:00:00 --gres=gpu:1 bash
$ ln -s GPU_routines/* . # required to put GPU code into same folder as the rest of the code;
# these links will conflict with a CPU-only compilation, so remove them if you
# need to compile the CPU version again
$ module load pgi
$ CUDA_HOME=/share/sw/pgi/linuxpower/2019/cuda/10.1/
$ echo -e "PGF_FLAGS= -c -g -O3 -Mpreprocess -Mbackslash -Mconcur -Mextend -Mcuda=cuda10.1 -Dsingle_precision -ta=tesla:cuda10.1 -Mfree -Mrecursive -DGPU
Hn0_tcmp: intermediate
\t pgf90 -Mcuda=cuda10.1 -ta=tesla:cuda10.1 -o ../MU_STEM.GPU.out *.o -L${CUDA_HOME}/lib -lcufft
modules:
\t pgf90 \$(PGF_FLAGS) quadpack.f90
\t pgf90 \$(PGF_FLAGS) m_precision.f90
\t pgf90 \$(PGF_FLAGS) m_string.f90
\t pgf90 \$(PGF_FLAGS) m_numerical_tools.f90
\t pgf90 \$(PGF_FLAGS) mod_global_variables.f90
\t pgf90 \$(PGF_FLAGS) m_crystallography.f90
\t pgf90 \$(PGF_FLAGS) m_electron.f90
\t pgf90 \$(PGF_FLAGS) m_user_input.f90
\t pgf90 \$(PGF_FLAGS) mod_cufft.f90
\t pgf90 \$(PGF_FLAGS) mod_CUFFT_wrapper.f90
\t pgf90 \$(PGF_FLAGS) mod_output.f90
\t pgf90 \$(PGF_FLAGS) m_multislice.f90
\t pgf90 \$(PGF_FLAGS) m_lens.f90
\t pgf90 \$(PGF_FLAGS) m_tilt.f90
\t pgf90 \$(PGF_FLAGS) m_absorption.f90
\t pgf90 \$(PGF_FLAGS) mod_cuda_array_library.f90
\t pgf90 \$(PGF_FLAGS) mod_cuda_potential.f90
\t pgf90 \$(PGF_FLAGS) m_potential.f90
\t pgf90 \$(PGF_FLAGS) MS_utilities.f90
\t pgf90 \$(PGF_FLAGS) mod_cuda_setup.f90
\t pgf90 \$(PGF_FLAGS) mod_cuda_ms.f90
\t pgf90 \$(PGF_FLAGS) s_absorptive_stem.f90
\t pgf90 \$(PGF_FLAGS) s_qep_tem.f90
\t pgf90 \$(PGF_FLAGS) s_qep_stem.f90
\t pgf90 \$(PGF_FLAGS) s_absorptive_tem.f90
\t pgf90 \$(PGF_FLAGS) muSTEM.f90
intermediate: *.f90 modules
\t pgf90 \$(PGF_FLAGS) *.f90
clean:
\t rm -f *.o *.mod *.tmp *.TMP *.out" > Makefile_GPU
$ make -f Makefile_GPU clean
$ make -f Makefile_GPU
I have access to an IBM Power9 GPU cluster. Obviously, the MKL libraries are not available for this architecture, but I was planning on trying to use the Makefile posted by @bryandesser. I'm relatively inexperienced when it comes to compiling things for clusters and fortran in general, but I was hoping to get this working on this system, since it has many powerful GPUs.
I'll update with how I make out, but I figured I should ask to see if there's any known limitations in the code that would prevent it from working without access to MKL.