E3SM-Project / EKAT

Tools and libraries for writing Kokkos-enabled HPC C++ in E3SM ecosystem
Other
15 stars 7 forks source link

Potential fix for Apple silicon build #272

Open mjs271 opened 1 year ago

mjs271 commented 1 year ago

For the past few months (~beginning of November), I haven't been able to build EKAT successfully on my mac laptop with an M1 chip that's on macos Monterey. First, the EKAT version I've been using (661dbc52) is what's used in the EAGLES project by haero and, by extension, mam4xx.

I've been attempting to build with the following configuration flags,

-DCMAKE_CXX_COMPILER=mpic++
-DCMAKE_Fortran_COMPILER=mpifort
-DKokkos_ENABLE_DEPRECATED_CODE=OFF
-DKokkos_ENABLE_DEBUG=TRUE
-DKokkos_ENABLE_AGGRESSIVE_VECTORIZATION=OFF
-DKokkos_ENABLE_CUDA=OFF
-DKokkos_ENABLE_SERIAL=ON
-DEKAT_ENABLE_FPE=OFF

in which mpic++ is built with Apple clang v14 and mpifort with gfortran 12.2. When I make, the errors I've been seeing are of the type:

EKAT/src/ekat/util/ekat_feutils.hpp:xx:yy: error: no member named '__control' in 'fenv_t' [...]
EKAT/src/ekat/util/ekat_feutils.hpp:xx:yy: error: no member named '__mxcsr' in 'fenv_t' [...]

The apparent fix turns out to be a matter of adding an #ifdef statement around an #include in ekat_arch.cpp, namely

#ifdef EKAT_ENABLE_FPE
  #include "ekat/util/ekat_feutils.hpp"
#endif

After a successful build, I get the following from make test

95% tests passed, 4 tests failed out of 75

Label Time Summary:
MustFail    =   0.64 sec*proc (3 tests)

Total Test time (real) =  22.83 sec

The following tests FAILED:
     53 - comm_np1 (Failed)
     54 - comm_np2 (Failed)
     55 - comm_np3 (Failed)
     56 - comm_np4 (Failed)

And the failure log output for these tests indicates that this is expected on mac, noting

A request was made to bind a process, but at least one node does NOT
support binding processes to cpus.

Node: <node>

Open MPI uses the "hwloc" library to perform process and memory
binding. This error message means that hwloc has indicated that
processor binding support is not available on this machine.

On OS X, processor and memory binding is not available at all (i.e.,
the OS does not expose this functionality).

Given all of this, I am not sure if this is a tenable fix or whether there may be knock-on effects. I did want to put it on the EKAT team's radar, though.

@jeff-cohere

welcome[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template!

bartgol commented 1 year ago

The binding to core is usually a good thing, which is why we do it by default. However, you are free to change the MPI extra args used when launching executable during tests. The CMake var EKAT_TEST_MPI_EXTRA_ARGS can be set to an empty string in your config file, so that CMake won't use the default binding options.

bartgol commented 1 year ago

It looks like CMake has a tool to check if a class/struct has a member. This would allow us to check that fenv_t exists, and that it contains the expected members.

Perhaps where we do

check_cxx_symbol_exists(feenableexcept "fenv.h" EKAT_HAVE_FEENABLEEXCEPT)

we should also do

check_struct_has_member(fenv_t __member "fenv.h" EKAT_FENV_HAS_MEMBER CXX)

and enable FENV stuff only if EKAT_FENV_HAS_MEMBER=TRUE.

mahf708 commented 3 months ago

Thanks for opening your first issue here! Be sure to follow the issue template!

I love this hearty welcome.

any update on this issue?

@bartgol, this is the first, not necessarily the last/only, problem that appears when trying to build pyscream on macos M? machines: https://github.com/mahf708/experimental-scream-feedstock

mjs271 commented 3 months ago

@mahf708 The short answer is, "no updates/progress from me."

Marginally expanding on that... I was successfully building EKAT on my M1 machine until earlier this year using the above hack. However, once starting work on SCREAM, I ran out of time to squash the other build issues I was running into and moved my work to a linux box exclusively.

Once I'm through this push on mam4xx, I'd be up for comparing notes and working on getting SCREAM/pyscream building on apple silicon. I'd personally love being able to run simple test cases locally for the sake of faster/offline development.

mahf708 commented 3 months ago

Yeah, sounds good!

I think the best way to iterate is through non-local setups, e.g., github actions. That's how I've been building and publishing the pyscream packages for linux (two mpi impls, four python versions, for a total of 8 builds). See an example run here: https://github.com/mahf708/experimental-scream-feedstock/actions/runs/10024430474, which automatically uploads the python packages here https://anaconda.org/mahf708/pyscream. We are undertaking this python effort to make it really simple to do some basic SCREAM science testing (on the fly without compilation) in python. I left an item on the meeting notes for today's call with an example and more information on linux :)

bartgol commented 3 months ago

@mahf708 Have you tried implementing the suggestion in my last comment? I don't have a macos machine, so I can't test it quickly (and not quick => not doing it, in this case, sorry).

mahf708 commented 3 months ago

@mahf708 Have you tried implementing the suggestion in my last comment? I don't have a macos machine, so I can't test it quickly (and not quick => not doing it, in this case, sorry).

That's precisely why I was suggesting the non-local setup :) I will set up a workflow for this on github actions on the repo linked above (with something running on macos machines) and start iterating with your fix above.

bartgol commented 3 months ago

Is it really a priority though?

mahf708 commented 3 months ago

No, linux is enough for the foreseeable future

mahf708 commented 2 months ago

Update: I could get this built with either of the edits above, but only for static linking. For shared builds, I get the following error:

[ 52%] Linking CXX shared library libekat_test_main.dylib
2 warnings generated.
[ 53%] Linking CXX executable tridiag
Undefined symbols for architecture arm64:
  "ekat_finalize_test_session()", referenced from:
      _main in ekat_catch_main.cpp.o
  "ekat_initialize_test_session(int, char**, bool)", referenced from:
      _main in ekat_catch_main.cpp.o
ld: symbol(s) not found for architecture arm64
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)