ecmwf / multio

MultIO is a runtime-configurable multiplexer for Weather Model output of GRIB data
Apache License 2.0
7 stars 8 forks source link

Add github actions to be able to run ci with MULTIO_SERVER on #3

Closed suvarchal closed 1 year ago

suvarchal commented 1 year ago

I am using develop branch of multio with both Fortran and multio server enabled (-DENABLE_MULTIO_SERVER=ON) as it is more relevant to us. I find that the only remaining test of multio that fails is test_multio_replay_nemo_fapi see following log:

test.log

@dsidoren also faced similar segfault while testing multio integration with FESOM2. So I wonder if it is a bug somewhere and would like you to confirm that fapi with io-server works.

Moreover, i see that the CI tests are passing for develop branch but i am not sure if -DENABLE_MULTIIO_SERVER is used in those tests.

At the least if you could confirm that test_multio_replay_nemo_fapi works in CI it would help us figure out the issue to an extent.

geier1993 commented 1 year ago

Hey @suvarchal, Thank you for taking your time and testing these things. All this developments have been recently done, hence we are really thankful for having external users & developers giving a look at it.

Indeed the Github actions don't seem to run the tests yet. However I can confirm that other CI running on Bamboo are performing the tests successfully.

How are you building multio and it's dependencies? Following your log, I assume that you are linking against the wrong ECKIT version. Please try to build ECKIT from develop as well.

The backtrace shows that in the init, the fckit call comm%communicator() is ending up in group() at eckit/mpi/Parallel.cc:648... This is really not expected. But the problem might be very simple:

comm%communicator() is supposed to call eckit::mpi::comm().communicator() which should result in Parallel.cc:683 via a VIRTUAL function call. The group functionality has been added to eckit just recently. So I have the assumption that if you are linking againsting an older the eckit version, the virtual function call might be failing. I.e. the vtable now has a different layout and a wrong lookup is performed. Instead of communicator(), group() gets called.

At least I have no other explanation so far. As suggested, please try to build ECKIT from the develop branch (https://github.com/ecmwf/eckit) and let us know.

geier1993 commented 1 year ago

Sry, I have to correct my answer. Indeed the MULTIO_SERVER tests are also not covered by Bamboo yet. However, the tests have been run manually on ATOS (the center's supercomputer in Bologna). Moreover NEMO developments have been tested with the Fortran API on ATOS as well.

suvarchal commented 1 year ago

Hey @suvarchal, Thank you for taking your time and testing these things. All this developments have been recently done, hence we are really thankful for having external users & developers giving a look at it.

Indeed the Github actions don't seem to run the tests yet. However I can confirm that other CI running on Bamboo are performing the tests successfully.

How are you building multio and it's dependencies? Following your log, I assume that you are linking against the wrong ECKIT version. Please try to build ECKIT from develop as well.

The backtrace shows that in the init, the fckit call comm%communicator() is ending up in group() at eckit/mpi/Parallel.cc:648... This is really not expected. But the problem might be very simple:

comm%communicator() is supposed to call eckit::mpi::comm().communicator() which should result in Parallel.cc:683 via a VIRTUAL function call. The group functionality has been added to eckit just recently. So I have the assumption that if you are linking againsting an older the eckit version, the virtual function call might be failing. I.e. the vtable now has a different layout and a wrong lookup is performed. Instead of communicator(), group() gets called.

At least I have no other explanation so far. As suggested, please try to build ECKIT from the develop branch (https://github.com/ecmwf/eckit) and let us know.

Thanks, while i can be sure that i used eckit develop because otherwise multio would not build successfully, but can never be as other dependencies like metkit(i used 1.9.2) also needed to be rebuild with eckit-develop and with such cross-dependency it may have linked somewhere to wrong versions. I will redo all compilations with develop branches and test again.

suvarchal commented 1 year ago

Thanks tests work now!! Yippe!!. Unfortunately I can't specifically tell what changed: for all dependencies (ecbuild, eccodes, eckit, metkit, fckit, fdb) I made sure I git-pulled changes in develop, before rebuilding them, for each of them in case things changed from time I last cloned 2 weeks ago and some did and tests work.

suvarchal commented 1 year ago

You may repurpose this issue to add CI/gh-action test for fortran-api (I guess it is about having mpi-libs, runtime and using mpiexec --oversubscribe, and btw mpi4py may also have some ready actions that they use) or you can close this.

geier1993 commented 1 year ago

Yay, glad to here it's working. Yes, let's keep that issue open. Thank you for supporting

dsarmany commented 1 year ago

The ci for develop now runs all multio-server tests on various platforms and with different compilers.