SeisSol / Training

BSD 3-Clause "New" or "Revised" License
11 stars 11 forks source link

add arm support #35

Closed wangyinz closed 3 months ago

wangyinz commented 1 year ago

This branch is a work in progress of adding a arm64 native build of container. This version will simply use a mpich library, so it will not run on HPC systems like #33. We should eventually build two parallel versions to support different architectures.

wangyinz commented 1 year ago

The last build failed after 4 hours... The error is from SeisSol with:

#54 3337.4 [ 34%] Building CXX object CMakeFiles/SeisSol-lib.dir/src/generated_code/subroutine.cpp.o
#54 3340.8 g++: error: unrecognized command-line option ‘-mno-red-zone’
#54 3340.8 make[2]: *** [CMakeFiles/SeisSol-lib.dir/build.make:1382: CMakeFiles/SeisSol-lib.dir/src/generated_code/subroutine.cpp.o] Error 1

More details can be found in the log: https://github.com/SeisSol/Training/actions/runs/4899728175/jobs/8749902315.

The problem seems to be from this line in SeisSol: https://github.com/SeisSol/SeisSol/blob/9b1b0ec970af4ad79a155c63035234b660838476/generated_code/SConscript#LL82C66-L82C77

The -mno-red-zone option is added deliberately, but this option is not recognized by the gcc under the arm architecture, because red zone is a x86 thing.

Any thoughts on how to get over this? @sebwolf-de @Thomas-Ulrich

wangyinz commented 1 year ago

The build took more than 6 hours, so it was cancelled by GitHub...

wangyinz commented 1 year ago

Took me a few hours to build on my laptop, and I had the container pushed to docker hub here

Note that because this build is compiled with noarch, the binary name is different. You might want to double check all the content of the notebook and change the binary name to SeisSol_Release_dnoarch_4_elastic or SeisSol_Release_dnoarch_4_viscoelastic2.

I tested in the emulator on my laptop using the tpv13 notebook. The gmsh, pumgen and vtk steps went through, but it failed at running seissol, with the following error:

!OMP_NUM_THREADS=4 mpirun -n 1 SeisSol_Release_dnoarch_4_elastic parameters.par

Sat May 06 22:24:23, Info:  Welcome to SeisSol 
Sat May 06 22:24:23, Info:  Copyright (c) 2012-2021, SeisSol Group 
Sat May 06 22:24:23, Info:  Built on: May  6 2023 18:10:48 
Sat May 06 22:24:23, Info:  Version: 9b1b0ec (modified) 
Sat May 06 22:24:23, Info:  Running on: "bd90741a4f80" 
Sat May 06 22:24:23, Info:  Using MPI with #ranks: 1 
Sat May 06 22:24:23, Info:  Using OMP with #threads/rank: 4 
Sat May 06 22:24:23, Info:  OpenMP worker affinity (this process): "01--45--89|--23--" 
Sat May 06 22:24:23, Info:  OpenMP worker affinity (this node)   : "01--45--89|--23--" 
Sat May 06 22:24:23, Info:  The stack size ulimit is  8192 [kb]. 
Sat May 06 22:24:23, Warn:  Stack size of 8192 [kb] is lower than recommended minimum of 2097152 [kb]. You can increase the stack size by running the command: ulimit -Ss unlimited. 
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <                SeisSol MPI initialization               >
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    |  Double precision used for real.
Rank:        0 | Info    | <--------------------------------------------------------->
 INFORMATION: The assumed unit number is           6 for stdout and           0 for stderr.
              If no information follows, please change the value.
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <     Start ini_SeisSol ...                               >
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <  Parameters read from file: parameters.par              >
Rank:        0 | Info    | <                                                         >
Rank:        0 | Info    | (Drucker-Prager) plasticity assumed .
Rank:        0 | Info    | Plastic relaxation Tv is set to:    2.9999999999999999E-002
Rank:        0 | Info    | Use averaging to sample material values, when implemented.
Rank:        0 | Info    | No attenuation assumed. 
Rank:        0 | Info    | No adjoint wavefield generated. 
Rank:        0 | Info    | Isotropic material is assumed. 
Rank:        0 | Info    | Read a PUML mesh file
Rank:        0 | Warning | Ignoring space order from parameter file, using           4
Rank:        0 | Info    | Volume output is in XDMF format (new implementation)
Rank:        0 | Info    | Output data are generated at delta T=    5.0000000000000000     
Rank:        0 | Info    | Use HDF5 XdmfWriter backend
Rank:        0 | Info    | Refinement strategy for volume output is Face Extraction :  4 subcells per cell
Sat May 06 22:24:23, Info:  Reading PUML mesh tpv13_training.puml.h5 
Sat May 06 22:24:23, Info:  Found 37074 cells 
Sat May 06 22:24:23, Info:  Found 6977 vertices 
Sat May 06 22:24:24, Info:  Computing LTS weights. 
Sat May 06 22:24:25, Info:  Limiting number of clusters to 2147483646 
Sat May 06 22:24:25, Info:  Computing LTS weights. Done.  (688 reductions.)
Sat May 06 22:24:26, Info:  Reading mesh. Done. 
Sat May 06 22:24:26, Info:  Extracting fault information 
Sat May 06 22:24:26, Info:  Mesh initialized in: 2.82334 (min: 2.82334, max: 2.82334)
Sat May 06 22:24:26, Warn:  Material Averaging is not implemented for plastic materials. Falling back to material properties sampled from the element barycenters instead. 
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

It seems to me that this could be an issue related to the qemu emulator and might go away when running on arm64 natively. Could you please confirm? Thank you!

wangyinz commented 1 year ago

Grabbed an arm instance on AWS to test the container. However, seissol fails at the same step with seg fault. Maybe it has something to do with the compile flags? Note that I simply removed the -mno-red-zone in this build. Not sure how that might lead to the seg fault though.

wangyinz commented 1 year ago

Tried another build natively on the arm64 node on AWS, but the run still fails with the same error. At this point, I believe there is something wrong with seissol itself. Not sure how to proceed... Below is the error message:

Sun May 07 02:49:24, Info:  Reading mesh. Done. 
Sun May 07 02:49:24, Info:  Extracting fault information 
Sun May 07 02:49:24, Info:  Mesh initialized in: 1.33762 (min: 1.33762, max: 1.33762)
Sun May 07 02:49:24, Warn:  Material Averaging is not implemented for plastic materials. Falling back to material properties sampled from the element barycenters instead. 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 90 RUNNING AT c040cefdbefd
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Note that the proxy code runs all fine:

root@c040cefdbefd:/home/tools/bin# SeisSol_proxy_Release_dnoarch_4_elastic 10000 100 all
Allocating fake data...
...done

=================================================
===            PERFORMANCE SUMMARY            ===
=================================================
seissol proxy mode                  : all
time for seissol proxy              : 10.687574
cycles                              : 0.000000

GFLOP (libxsmm)                     : 0.000000
GFLOP (pspamm)                      : 0.000000
GFLOP (libxsmm + pspamm)            : 0.000000
GFLOP (non-zero) for seissol proxy  : 50.672430
GFLOP (hardware) for seissol proxy  : 112.950000
GiB (estimate) for seissol proxy    : 21.860003

FLOPS/cycle (non-zero)              : inf
FLOPS/cycle (hardware)              : inf
Bytes/cycle (estimate)              : inf

GFLOPS (non-zero) for seissol proxy : 4.741247
GFLOPS (hardware) for seissol proxy : 10.568348
GiB/s (estimate) for seissol proxy  : 2.045366
=================================================

root@c040cefdbefd:/home/tools/bin# SeisSol_proxy_Release_dnoarch_4_viscoelastic2 10000 100 all
Allocating fake data...
...done

=================================================
===            PERFORMANCE SUMMARY            ===
=================================================
seissol proxy mode                  : all
time for seissol proxy              : 18.992046
cycles                              : 0.000000

GFLOP (libxsmm)                     : 0.000000
GFLOP (pspamm)                      : 0.000000
GFLOP (libxsmm + pspamm)            : 0.000000
GFLOP (non-zero) for seissol proxy  : 115.187430
GFLOP (hardware) for seissol proxy  : 222.960000
GiB (estimate) for seissol proxy    : 46.111643

FLOPS/cycle (non-zero)              : inf
FLOPS/cycle (hardware)              : inf
Bytes/cycle (estimate)              : inf

GFLOPS (non-zero) for seissol proxy : 6.065035
GFLOPS (hardware) for seissol proxy : 11.739651
GiB/s (estimate) for seissol proxy  : 2.427945
=================================================

btw, the build from #33 fails even with the proxy code because of the invalid avx2 instructions:

root@36278a11a601:/home/training/tpv13# SeisSol_proxy_Release_dhsw_4_elastic 1000 100 all
qemu: uncaught target signal 4 (Illegal instruction) - core dumped
Illegal instruction (core dumped)

So, the image built here does run properly on arm64. The error is likely due to SeisSol itself.

wangyinz commented 1 year ago

I probably should take above back. I further ran the three other cases in the container and found that they failed at different step (all at the very beginning though). This reminds me that the issue might be memory related - the arm64 instance I got only have 4GB of memory, which may not be enough for the run. Is that true? Do you have an estimate of memory requirement for these runs? A close monitor with the top command does reveal a memory spike right before the failure, so it is probably the case. Can someone with access to an arm-based Mac to please verify the container?

krenzland commented 1 year ago

-DHOST_ARCH=noarch is really going to break performance of the code. Maybe with the backend LIBXSMM_JIT, this could be mitigated a bit. (Not a priority)

Nite that the arch "thunderx2t99" may work also on M1/M2 chips. Definitely not optimal, but it atleast should activate vectorization. We can also add similar settings for M1/M2 but this is definitely not a priority for us.

I'm also not surprised that building a container with QEMU is taking a long time...

wangyinz commented 1 year ago

I did not use libxsmm in this build as I thought libxsmm does not support arm. Then, I found that this is not accurate: there is no support in any of the released versions, but they do seem to have the development version that has support. Still, I wanted to play safe so used eigen instead.

I am not sure noarch will actually impact too much of the performance on Apple Silicon because the chip does not have SVE anyway. I think the compiler should enabled the SIMD optimization by default. I don't have the time to test it out, but since this version of container is to enabled the training material to the majority, I don't think performance is the priority anyway.

I am back to my office so was able to test it out on a M1 arm macbook. It turns out that the tpv13 run still seg faults at the same place, but I was able to get the Kaikoura case running. Not sure what the expected performance should be, but below is running with 4 omp threads:

Mon May 08 14:13:51, Info:  Writing energy output at time 0.6 
Mon May 08 14:13:52, Info:  Writing energy output at time 0.6 Done. 
Mon May 08 14:13:52, Info:  Performance since the start: 0.00769377 TFLOP/s (rank 0: 7.69377 GFLOP/s, average over ranks: 7.69377 GFLOP/s) 
Mon May 08 14:13:52, Info:  Performance since last sync point: 0.00801944 TFLOP/s (rank 0: 8.01944 GFLOP/s, average over ranks: 8.01944 GFLOP/s) 

I also tested the three other cases and found that the sulawesi case failed at:

Mon May 08 14:17:26, Info:  Reading mesh. Done. 
Mon May 08 14:17:26, Info:  Extracting fault information 
Mon May 08 14:17:26, Info:  Mesh initialized in: 3.62567 (min: 3.62567, max: 3.62567)
Mon May 08 14:17:26, Warn:  Material Averaging is not implemented for plastic materials. Falling back to material properties sampled from the element barycenters instead. 
Mon May 08 14:17:26, Warn:  ASAGI: NUMA communication could not be enabled because the ASAGI is not compiled with NUMA support. 
Mon May 08 14:17:26, Warn:  ASAGI: NUMA communication could not be enabled because the ASAGI is not compiled with NUMA support. 
Mon May 08 14:17:26, Warn:  ASAGI: NUMA communication could not be enabled because the ASAGI is not compiled with NUMA support. 
Mon May 08 14:17:26, Warn:  ASAGI: NUMA communication could not be enabled because the ASAGI is not compiled with NUMA support. 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 213 RUNNING AT 2bfa7cb22d3b
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

The Northridge case started the calculation, but fialed with the Inf/NaN error:

Mon May 08 14:21:50, Info:  Writing free surface at time 0.
Mon May 08 14:21:50, Info:  Writing free surface at time 0. Done.
Mon May 08 14:21:50, Info:  Writing energy output at time 0 
Mon May 08 14:21:50, Info:  Writing energy output at time 0 Done. 
Mon May 08 14:22:07, Info:  Writing energy output at time 0.5 
Mon May 08 14:22:08, Info:  Elastic energy (total, % kinematic, % potential):  nan  , nan  , nan 
Mon May 08 14:22:08, Error: Detected Inf/NaN in energies. Aborting. 
Backtrace:
SeisSol_Release_dnoarch_4_elastic(+0x6acc4) [0xaaaaca43acc4]
SeisSol_Release_dnoarch_4_elastic(+0x118008) [0xaaaaca4e8008]
SeisSol_Release_dnoarch_4_elastic(+0x1b17c4) [0xaaaaca5817c4]
SeisSol_Release_dnoarch_4_elastic(+0x608bc) [0xaaaaca4308bc]
SeisSol_Release_dnoarch_4_elastic(+0x5e658) [0xaaaaca42e658]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0x4000177f73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0x4000177f74cc]
SeisSol_Release_dnoarch_4_elastic(+0x65e30) [0xaaaaca435e30]
Abort(134) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 134) - process 0
Assertion failed in file src/binding/c/coll/barrier.c at line 36: 0
/lib/aarch64-linux-gnu/libmpich.so.12(+0x2202e4) [0x4000152d02e4]
/lib/aarch64-linux-gnu/libmpich.so.12(MPI_Barrier+0x24c) [0x4000150f289c]
SeisSol_Release_dnoarch_4_elastic(+0x89cad4) [0xaaaacac6cad4]
SeisSol_Release_dnoarch_4_elastic(+0x89dccc) [0xaaaacac6dccc]
SeisSol_Release_dnoarch_4_elastic(+0x5dcae0) [0xaaaaca9acae0]
SeisSol_Release_dnoarch_4_elastic(+0x6623a0) [0xaaaacaa323a0]
SeisSol_Release_dnoarch_4_elastic(+0x665794) [0xaaaacaa35794]
SeisSol_Release_dnoarch_4_elastic(+0x665c40) [0xaaaacaa35c40]
SeisSol_Release_dnoarch_4_elastic(+0x872b8c) [0xaaaacac42b8c]
SeisSol_Release_dnoarch_4_elastic(+0x858fc8) [0xaaaacac28fc8]
SeisSol_Release_dnoarch_4_elastic(+0x661c38) [0xaaaacaa31c38]
SeisSol_Release_dnoarch_4_elastic(+0x6ce600) [0xaaaacaa9e600]
SeisSol_Release_dnoarch_4_elastic(+0x662f50) [0xaaaacaa32f50]
SeisSol_Release_dnoarch_4_elastic(+0x5cb898) [0xaaaaca99b898]
/lib/aarch64-linux-gnu/libc.so.6(+0x3cde8) [0x40001780cde8]
/lib/aarch64-linux-gnu/libc.so.6(+0x3cf0c) [0x40001780cf0c]
/lib/aarch64-linux-gnu/libmpich.so.12(+0x21fe60) [0x4000152cfe60]
/lib/aarch64-linux-gnu/libmpich.so.12(+0x2053b0) [0x4000152b53b0]
/lib/aarch64-linux-gnu/libmpich.so.12(MPI_Abort+0x1c8) [0x400015182878]
SeisSol_Release_dnoarch_4_elastic(+0x6ac48) [0xaaaaca43ac48]
SeisSol_Release_dnoarch_4_elastic(+0x118008) [0xaaaaca4e8008]
SeisSol_Release_dnoarch_4_elastic(+0x1b17c4) [0xaaaaca5817c4]
SeisSol_Release_dnoarch_4_elastic(+0x608bc) [0xaaaaca4308bc]
SeisSol_Release_dnoarch_4_elastic(+0x5e658) [0xaaaaca42e658]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0x4000177f73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0x4000177f74cc]
SeisSol_Release_dnoarch_4_elastic(+0x65e30) [0xaaaaca435e30]
Abort(1) on node 0: Internal error

So, there are still issues with the SeisSol build, but the container is built properly for the arm architecture.

wangyinz commented 1 year ago

Just another update: I ran the Kaikoura case again, and the performance is significantly improved. It seems making more sense now.

Mon May 08 19:33:09, Info:  Writing energy output at time 0.4 
Mon May 08 19:33:09, Info:  Writing energy output at time 0.4 Done. 
Mon May 08 19:33:09, Info:  Performance since the start: 0.042381 TFLOP/s (rank 0: 42.381 GFLOP/s, average over ranks: 42.381 GFLOP/s) 
Mon May 08 19:33:09, Info:  Performance since last sync point: 0.0430451 TFLOP/s (rank 0: 43.0451 GFLOP/s, average over ranks: 43.0451 GFLOP/s) 
Mon May 08 19:33:42, Info:  Writing energy output at time 0.6 
Mon May 08 19:33:43, Info:  Writing energy output at time 0.6 Done. 
Mon May 08 19:33:43, Info:  Performance since the start: 0.0426642 TFLOP/s (rank 0: 42.6642 GFLOP/s, average over ranks: 42.6642 GFLOP/s) 
Mon May 08 19:33:43, Info:  Performance since last sync point: 0.0432421 TFLOP/s (rank 0: 43.2421 GFLOP/s, average over ranks: 43.2421 GFLOP/s) 
krenzland commented 1 year ago

I did not use libxsmm in this build as I thought libxsmm does not support arm. Then, I found that this is not accurate: there is no support in any of the released versions, but they do seem to have the development version that has support. Still, I wanted to play safe so used eigen instead.

The latest release has (undocumented) support for Arm but only for selected CPUs. It may not work for Apple silicon.

I am not sure noarch will actually impact too much of the performance on Apple Silicon because the chip does not have SVE anyway. I think the compiler should enabled the SIMD optimization by default. I don't have the time to test it out, but since this version of container is to enabled the training material to the majority, I don't think performance is the priority anyway.

They should have NEON support at least. I don't know what code the compiler is going to emit for Arm architectures without a specified tuning target. It likely is going to be suboptimal.

sebwolf-de commented 1 year ago

@wangyinz could you please rebase this onto the current main branch? IMHO this makes the review easier :D

krenzland commented 1 year ago

Try setting: https://github.com/SeisSol/SeisSol/blob/70232f83f1e57d79da2b2cdea1afff7713c7568d/cmake/cpu_arch_flags.cmake#LL41C8-L41C42

set(HAS_REDZONE OFF PARENT_SCOPE)

(Might break some configurations on Intel hardware but might help on Arm)

wangyinz commented 1 year ago

I had that set already to get to the success build: https://github.com/SeisSol/Training/blob/69c4df8313c645aac46f787e9df15582a8f3bcc6/Dockerfile_jupyterlab#L105-L108 Also, the one in SConscript also needs to be removed.

krenzland commented 1 year ago

The only other thing that arch does is to specify the alignment: https://github.com/SeisSol/SeisSol/blob/1a7fcd18c4eb30fd3b2f7d026fa3d001030c33db/cmake/process_users_input.cmake#L41 (We should actually align to 64bytes on most systems anyway to align to the cacheline.)

The value in the SConscript doesn't matter, it isn't used anymore.

wangyinz commented 1 year ago

So, thunderx2t99 and noarch are both aligned to 16? Do you think I should add a line with sed to manually set that to 64?

Somehow I thought the one in SConscript gave me an error, but maybe I remembered wrong.

AliceGabriel commented 1 year ago

Here is an overview, of my private M1 testing of the current PR :

sebwolf-de commented 1 year ago

My few remarks:

krenzland commented 1 year ago

Isn't Northridge the only scenario that doesn't use dynamic rupture? The small alignment of noarch might destroy the DR code. Maybe the alignment hides slightly incorrect memory accesses?

sebwolf-de commented 1 year ago

Indeed, it's the only scenario without DR, but Kaikoura works for a few minutes, so DR is not completely broken.

krenzland commented 1 year ago

I'm also not sure why Northridge runs only with attenuation. In the current implementation, viscoelasticity uses the same wave propagation kernels as the elastic code.

krenzland commented 11 months ago

I can reproduce the segfaults, even when using a specific Apple M2 arch setting. I have no idea why, it seems to run well without Docker. I'll investigate.

davschneller commented 11 months ago

A small side comment, the "no redzone" fixes should not be necessary anymore; noarch now doesn't add that parameter anymore by default. (at least when using the latest master, v1.1.0 doesn't have that change in yet; EDIT: v1.1.1 contains that patch)

davschneller commented 10 months ago

The segfaults with tpv13 could be due to ASAGI, or the SeisSol ASAGI reader—even though ASAGI is not even used there. But: it's compiled into the binary. Thus, ASAGI is called here https://github.com/SeisSol/SeisSol/blob/master/src/Reader/AsagiReader.h which in turn is called by https://github.com/SeisSol/SeisSol/blob/313c4e4c459b1ea67302b8887650f51d1ebbf9e7/src/Initializer/ParameterDB.cpp#L626 when initializing an easi model. And the last message you'd see before ending up there is exactly a warning like "falling back to materials sampled from cell barycenters".

krenzland commented 10 months ago

I can reproduce the crashes but somehow have a very hard time debugging them due to an unrelated issue :( I'm working on it!

wangyinz commented 3 months ago

I tried a few different combinations and we learned that:

  1. PSpaMM + neon: fails with Inf/NaN
  2. PSpaMM + noarch: works
  3. LIBXSMM_JIT + neon or noarch: seg fault
    
    Tue May 21 18:57:21, Info:  Initialize Memory layout. 
    Tue May 21 18:57:21, Info:  Initialize cell-local matrices. 

LIBXSMM_VERSION: feature_mxfp4_bf16_avx2_gemms-1.17-3727 (25693839) AARCH64/DP TRY JIT STA COL 0..13 45 0 0 0 14..23 5 0 0 0 24..64 2 0 0 0 Registry and code: 13 MB Command: SeisSol_Release_dhsw_4_elastic parameters.par Uptime: 2.325351 s

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 137 RUNNING AT 0494e8febeb4 = EXIT CODE: 11 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11) This typically refers to a problem with your application. Please see the FAQ page for debugging suggestions

The latest PSpaMM also have issues and we have to use `@davschneller/compile-fixes` branch for now. The LIBXSMM issue might relates to their latest update, but we don't know for sure. There is no official stable release of libxsmm that supports arm.

In any case, arm users can use the latest build pushed to Docker Hub: https://hub.docker.com/r/wangyinz/seissoltraining/tags with the command 

docker pull wangyinz/seissoltraining:test_arm

The tpv13 case ran successfully at almost 40 GFLOP/s on an M1 Macbook : 

Tue May 21 17:35:25, Info: Total time spent in compute kernels: 146.643 s ( = 2 min 26.6432 s ) Tue May 21 17:35:25, Info: Total calculated HW-FLOP: 5.8350 TFLOP Tue May 21 17:35:25, Info: Total calculated NZ-FLOP: 2.9176 TFLOP Tue May 21 17:35:25, Info: Total calculated HW-FLOP/s: 39.1719 GFLOP/s Tue May 21 17:35:25, Info: Total calculated NZ-FLOP/s: 19.5867 GFLOP/s

davschneller commented 3 months ago

It should be noted that using PSpaMM together with noarch as architecture will cause Yateto to generate pure C++ loops for the matrix multiplications and avoid the explicit code generation (i.e. inline assembly) entirely.

wangyinz commented 3 months ago

Quite surprisingly, the multi-arch docker build finished!

Note that this branch has the setup to build both amd64 and arm64 architectures in the same image (which is here already). Previously the arm64 build runs too slow that it cannot finish within the 6 hour limit of the running. I guess GitHub has upgraded the runnings and now the workflow finishes in less than 4 hours. Still slow, but we now can ask the attendees to pull the same image regardless of the arch they need.

davschneller commented 3 months ago

Great to see that! ... That reminds me... There was a doubling of the cores for the runners recently: https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/

AliceGabriel commented 3 months ago

Hi I am still making my way to Seattle - does this mean we don’t need my M1 test build anymore?

Prof. Dr. habil. Alice-Agnes Gabriel

Guest Professor Earthquake Physics, Institute of Geophysics, Department of Earth and Environmental Sciences, Ludwig-Maximilians-Universität (LMU) München, Munich, Germany

Associate Professor Institute of Geophysics and Planetary Physics Scripps Institution of Oceanography University of California at San Diego, La Jolla, USA

On Tue 21. May 2024 at 14:23, David Schneller @.***> wrote:

Yay! ... That reminds me... There was a doubling of the cores for the runners recently: https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/

— Reply to this email directly, view it on GitHub https://github.com/SeisSol/Training/pull/35#issuecomment-2123465934, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACHOTRF75QXC4BY6DV2E2MDZDO3OLAVCNFSM6AAAAAAXXX3EJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGQ3DKOJTGQ . You are receiving this because you commented.Message ID: @.***>

wangyinz commented 3 months ago

I have tested on my m1 macbook and can confirm this build does not have the NaN error (at least not in the tpv13 case).

davschneller commented 3 months ago

As a note, I can reproduce PSpaMM+neon failing with an inf/nan, while emulating the system with QEMU on an X86-64 machine.

Maybe it is indeed possible to debug the ARM container (albeit slow, with crashing Python) for us non-mac users.