PrincetonUniversity / tristan-mp-v2

Tristan-MP v2 [public]
https://princetonuniversity.github.io/tristan-v2/
BSD 3-Clause "New" or "Revised" License
12 stars 2 forks source link

Problem compiling/running on an NCAR AMD EPYC machine #5

Closed haykh closed 3 months ago

haykh commented 2 years ago

Copying from an emailed issue.

What computer/cluster are you using? I mean what CPUs does it have?

It’s a typical Cray System, with AMD Rome cpus similar to the ones you find in this link at NCAR AMD, Cray, Nvidia Behind Massive NCAR Supercomputer Upgrade (nextplatform.com)

And what compiler and flags do you use in the Makefile?

I tried the following ones:

1:

FC := h5pfc
FFLAGS := -O3 -DSoA -ipo -qopenmp-simd -qopt-report=5 -qopt-streaming-stores auto
PFLAGS := -DHDF5 -DMPI -DtwoD -DNGHOST=3

The above flags give me the errors in the links I described (in the email I sent earlier today). The errors are in the compilation process. So, with these flags I don’t get to run the code.

But I also tried,

2:

FC := ftn
LD := h5pfc
FFLAGS := -Ofast -O3 -DSoA #-qopenmp-simdi -qopt-report=5 -qopt-streaming-stores auto 
PFLAGS := -DHDF5 -DMPI -DtwoD -DNGHOST=3

Note that in the last one, the code compiles (it prints the m_name.mod files though and creates the executable tristan-mp2d is this correct?) By the way, the ftn is the fortran compiler in our program environment.

And,

3:

FC := ftn
FFLAGS:= -Ofast -O3 -DSoA -ipo -qopenmp-simd -qopt-report=5 -qopt-streaming-stores auto
-I/../hdf5-parallel/[*/include](http://*/include) -dynamic -Wl,-rpath
-Wl,/../hdf5-parallel/[*/INTEL/19.1/lib](http://*/INTEL/19.1/lib)
-Wl,/../mpich/8.1.9/ofi/intel/19.0/lib/libmpi_intel.so
PFLAGS := -DHDF5 -DMPI -DtwoD -DNGHOST=3

Note that in the last one, the code compiles (doesn’t print any m_name.mod files though, but creates the executable tristan-mp2d is this correct?). But it does fail when running. The message when running after hundreds of timesteps is:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
tristan-mp2d       00000000004AF0AA  for__signal_handl     Unknown  Unknown
libpthread-2.26.s  00001534C199F310  Unknown               Unknown  Unknown
[libc-2.26.so](http://libc-2.26.so/)       00001534C1659311  cfree                 Unknown  Unknown

I remember vectorization flags are a bit different for Intel and AMD machines, even when using intel compilers on both.

It may be the vectorization flags, for example with using the flag: -ipo I get the following warning during compilation (not error) only warning: ipo: warning #11021: unresolved PMI2_Finalize

        Referenced in /…/mpi/8.1.9/ofi/intel/19.0/lib/libmpi_intel.so

However, I do get the above error when running, i.e., (actually even if I don’t include the -ipo flag):

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
tristan-mp2d       00000000004AF0AA  for__signal_handl     Unknown  Unknown
libpthread-2.26.s  00001534C199F310  Unknown               Unknown  Unknown
[libc-2.26.so](http://libc-2.26.so/)       00001534C1659311  cfree                 Unknown  Unknown
haykh commented 2 years ago

I updated the configure script adding a few safeguards and AMD support. Could you please try the following configuration:

python configure.py ... -mpi08 -amd --debug=1

This way we can at least localize the problem, since it disables hdf5, vectorization, and ipo optimization.

If this works, we can move on turning the vectorization on (notice that AMD EPYCs do not support avx512):

python configure.py ... -mpi08 -amd -avx2

If it fails we might also try with an older mpi standard: (just replace -mpi08 with the -mpi flag when configuring).

For now I would not change anything manually in the Makefile. Let's go with the specified defaults.

What modules do you load on the cluster? I mean what compilers are actually being used? The code actually relies on intel compilers for linear indexing of multi-D arrays. So it most likely will not run if you compile with gcc.

RSoto28 commented 2 years ago

Hi, I follow your instructions and used the -mpi08 -amd - -debug=1 flags. The user files I used where langmuir and 2drec. I get warnings during the compilation (see below), the codes does creates the executable though. However, I get the following error when running the code (with both user files): forrtl: error (72): floating overflow

RSoto28 commented 2 years ago

The warnings I get when compiling 'langmuir' are:

... ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -module obj/ -c src_/exchangeparticles.F90 -o obj/exchangeparticles.o src_/exchangeparticles.F90(341): warning #8100: The actual argument is an array section or assumed-shape array, corresponding dummy argument that has either the VOLATILE or ASYNCHRONOUS attribute shall be an assumed-shape array. [ENROUTE] call MPI_ISEND(enroute_bot%get(ind1,ind2,ind3)%enroute(1:enroutebot%get(ind1,ind2,ind3)%cnt),& -----------------------------------------------------------^ ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -module obj/ -c src/exchangefields.F90 -o obj/exchangefields.o ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -module obj/ -c src_/exchangecurrents.F90 -o obj/exchangecurrents.o ... ... ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -module obj/ -c src_/userlangmuir.F90 -o obj/userlangmuir.o src_/userlangmuir.F90(27): warning #6178: The return value of this FUNCTION has not been defined. [USERSPATIALDISTRIBUTION] function userSpatialDistribution(x_glob, y_glob, zglob,& -----------^ src/userlangmuir.F90(34): warning #6178: The return value of this FUNCTION has not been defined. [USERSLBLOAD] function userSLBload(x_glob, y_glob, zglob,& -----------^ ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -module obj/ -c src/writeslice.F90 -o obj/writeslice.o ... ... ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -o exec/tristan-mp2d obj/globalnamespace.o obj/outputnamespace.o obj/auxiliary_.o obj/errorhandling.o obj/readinput.o obj/domain.o obj/fields.o obj/particles.o obj/helpers.o obj/fieldlogistics.o obj/particlebinning.o obj/binarycoupling.o obj/particlelogistics.o obj/particledownsampling.o obj/thermalplasma.o obj/powerlawplasma.o obj/exchangeparticles.o obj/exchangefields.o obj/exchangecurrents.o obj/exchangearray.o obj/outputlogistics.o obj/writeusroutput.o obj/restart_.o obj/staticlb.o obj/adaptivelb.o obj/loadbalancing_.o obj/userlangmuir.o obj/writeslice.o obj/writeparams.o obj/write_totflds.o obj/write_totprtl.o obj/writespectra.o obj/writediagnostics.o obj/writetot.o obj/writehistory.o obj/particlemover.o obj/fieldsolver.o obj/currentdeposit.o obj/filtering.o obj/initialize.o obj/finalize.o obj/tristanmainloop.o obj/testsite.o obj/tristan.o

RSoto28 commented 2 years ago

The errors I get when running 'langmuir' are:

........................................................................

== Fiducial physical parameters ======================================== skin depth [dx]: 10.000 plasma oscillation period [dt]: 139.626 gyroradius [dx]: 4.472 gyration period [dt]: 62.443 ........................................................................


Timestep: 0......................................................[DONE] [ROUTINE] [TIME, ms] [MIN / MAX, ms] [FRACTION, %] Full_step: 50.354 48.341 53.857 move_step: 6.015 5.964 6.300 11.945 deposit_step: 4.560 4.350 4.798 9.055 filter_step: 3.509 3.067 4.200 6.969 fld_exchange: 26.318 25.303 27.798 52.267 prtl_exchange: 4.766 3.706 5.294 9.465 fld_solver: 0.036 6.342E-03 0.783 0.072 usr_funcs: 0.108 0.099 0.311 0.214 output_step: 5.038 3.494 8.272 10.004 [NPART per S] [AVERAGE] [MIN/MAX per CPU] [TOTAL] species # 1 1.600E+05 1.599E+05 1.601E+05 4.480E+06 species # 2 0.000 0.000 0.000 0.000 .......................................................................

forrtl: error (72): floating overflow Image PC Routine Line Source tristan-mp2d 000000000049260B Unknown Unknown Unknown libpthread-2.26.s 00001506685DC310 Unknown Unknown Unknown tristan-mp2d 000000000046C895 m_mover_mp_movepa 11 boris_push.F tristan-mp2d 0000000000475DC7 m_mainloop_mpmai 101 tristanmainloop.F90 tristan-mp2d 0000000000478DFA MAIN 9 tristan_.F90 tristan-mp2d 000000000040B912 Unknown Unknown Unknown libc-2.26.so 00001506682323EA libc_start_main Unknown Unknown tristan-mp2d 000000000040B82A Unknown Unknown Unknown forrtl: error (72): floating overflow Image PC Routine Line Source tristan-mp2d 000000000049260B Unknown Unknown Unknown libpthread-2.26.s 000015501D3BF310 Unknown Unknown Unknown tristan-mp2d 000000000046C895 m_mover_mp_movepa 11 boris_push.F tristan-mp2d 0000000000475DC7 m_mainloop_mpmai 101 tristanmainloop.F90 tristan-mp2d 0000000000478DFA MAIN 9 tristan_.F90 tristan-mp2d 000000000040B912 Unknown Unknown Unknown libc-2.26.so 000015501D0153EA libc_start_main Unknown Unknown tristan-mp2d 000000000040B82A Unknown Unknown Unknown forrtl: error (72): floating overflow

RSoto28 commented 2 years ago

The warning I get when compiling with user file 2d_rec is:

make all mkdir -p obj/ exec/ src_/ cpp -nostdinc -C -P -w -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 src/globalnamespace.F90 > src/globalnamespace.F90; cpp -nostdinc -C -P -w -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 src/outputnamespace.F90 > src/outputnamespace.F90;
cpp -nostdinc -C -P -w -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 src/tools/auxiliary.F90 > src/auxiliary.F90; … … ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -module obj/ -c src_/powerlawplasma.F90 -o obj/powerlawplasma.o ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -module obj/ -c src_/exchangeparticles.F90 -o obj/exchangeparticles.o src_/exchangeparticles.F90(341): warning #8100: The actual argument is an array section or assumed-shape array, corresponding dummy argument that has either the VOLATILE or ASYNCHRONOUS attribute shall be an assumed-shape array. [ENROUTE] call MPI_ISEND(enroute_bot%get(ind1,ind2,ind3)%enroute(1:enroutebot%get(ind1,ind2,ind3)%cnt),& -----------------------------------------------------------^ ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -module obj/ -c src/exchangefields.F90 -o obj/exchangefields.o ftn -traceback -fpe0 -DMPI08 -DDEBUG -DtwoD -DNGHOST=3 -module obj/ -c src_/exchangecurrents.F90 -o obj/exchangecurrents.o

RSoto28 commented 2 years ago

The error I get when running with 2d_rec is (essentially the same as before but in this case not even timestep 0 was completed):

The error I get is:

........................................................................

== Fiducial physical parameters ======================================== skin depth [dx]: 5.000 plasma oscillation period [dt]: 69.813 gyroradius [dx]: 0.500 gyration period [dt]: 6.981 ........................................................................

forrtl: error (72): floating overflow Image PC Routine Line Source tristan-mp2d 0000000000497BEB Unknown Unknown Unknown libpthread-2.26.s 000014871DABF310 Unknown Unknown Unknown tristan-mp2d 0000000000471F0D m_mover_mp_movepa 18 boris_push.F tristan-mp2d 000000000047B3A7 m_mainloop_mpmai **101 tristanmainloop.F90** tristan-mp2d 000000000047E3DA MAIN 9 tristan_.F90 tristan-mp2d 000000000040B992 Unknown Unknown Unknown libc-2.26.so 000014871D7153EA __libc_start_main Unknown Unknown tristan-mp2d 000000000040B8AA Unknown Unknown Unknown forrtl: error (72): floating overflow Image PC Routine Line Source tristan-mp2d 0000000000497BEB Unknown Unknown Unknown libpthread-2.26.s 000014B026723310 Unknown Unknown Unknown tristan-mp2d 0000000000418033 m_helpers_mpinte 576 helpers.F90 tristan-mp2d 00000000004719DC m_mover_mp_movepa 89 particlemover.F90 tristan-mp2d 000000000047B3A7 m_mainloop_mpmai 101 tristanmainloop.F90 tristan-mp2d 000000000047E3DA MAIN 9 tristan_.F90 tristan-mp2d 000000000040B992 Unknown Unknown Unknown libc-2.26.so 000014B0263793EA __libc_start_main Unknown Unknown tristan-mp2d 000000000040B8AA Unknown Unknown Unknown

haykh commented 2 years ago

The warnings are actually fine, the MPI_ISEND is never used for your case.

As for the errors, that is very weird. My bet would be that the interpolation is doing something wrong. I have seen this kind of behavior before on different compilers. The reason behind it is that the interpolation uses 1d indexing of 3d arrays (which is supported by intel compilers and not by, e.g., gcc). In other words, instead of doing, say, ex(i, j, k) we do ex(lind, 0, 0) where lind is something like lind = i + (NGHOST + j) * (sx + 2 * NGHOST) + (NGHOST + k) * (sx + 2 * NGHOST) * (sy + 2 * NGHOST). This is a little bit faster for certain compilers, but evidently it fails for the others.

I am unsure if the ftn compiler you are using is a native intel compiler, or an intel/gcc or something like that. Is it possible to see what exactly is that compiler sourcing? E.g., you may run ftn --version or something like that.

As an alternative I can change the interpolation routine to support regular ijk indexing. That should take about hr-30min.

RSoto28 commented 2 years ago

Hi Hyak, yes this is the fortran compiler:

ftn --version ifort (IFORT) 2021.3.0 20210609 Copyright (C) 1985-2021 Intel Corporation. All rights reserved.


Also, I wanted to comment that before, that is without using your flags -debug and -amd above, but the ones given on the tristan wiki page. The code would run for a few time steps and then I would get this following error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source tristan-mp2d 00000000004AF0AA for__signal_handl Unknown Unknown libpthread-2.26.s 00001534C199F310 Unknown Unknown Unknown libc-2.26.so 00001534C1659311 cfree Unknown Unknown

I tried different Fflags and the error does not go away. Finally, I’ve found this webpage:

https://community.intel.com/t5/Intel-Fortran-Compiler/forrtl-severe-174-SIGSEGV-segmentation-fault-occurred/td-p/746775

There, some people propose to add the flag: -heap-arrays because it could be a stack size (stack overflow) in the code. I did that and the code ran a bit longer but still the gives the same error severe (174) (above) after some time running.

I don't know if the errors are related. But, what caught my attention is that people talk about overflows when addressing the other error (severe (174)), as well. I'm not an expert so I wouldn't know.

-Thanks.

haykh commented 2 years ago

looks like that's indeed the bound violation because of 1d indexing. the problem is I am unable to reproduce it on our amd machines with local intel compilers. i'll try to fix so the code runs on gcc too, that'll hopefully cover the intel compilers you are using.

haykh commented 2 years ago

hi @RSoto28. i have updated the code to v2.5 and also tested it on gcc compilers (there's a docker image available for testing). this in theory should now work on anything. could you please test? it's pretty hard to debug things without having first hand access.

ps. as a bonus, there's a bunch of new physics included in the new version. so feel free to pull and play around.