Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide)

mefuller commented 2 years ago

Problem description

All builds on the ppc64le architecture with F36/Rawhide now fail. This was not the case four days ago (see https://copr.fedorainfracloud.org/coprs/fuller/Cantera/builds/).

Concurrently, the kinetics: KineticsAddSpecies3.add_species_sequential test now fails on "successful" builds on F34/35 for ppc64le architectures (and also for aarch64, i686 and s390x on all three Fedoras 34, 35, and 36/Rawhide, but not x86_64). This may not be a new problem as I only just added testing automation to the build automation for all architectures. (Yes, I know, I should have done that earlier)

Steps to reproduce

scons build && scons test

Behavior

Error message: Build failure (ppc64le:F36):

In file included from /usr/include/eigen3/Eigen/Core:210,
                 from /usr/include/eigen3/Eigen/SparseCore:11,
                 from /usr/include/eigen3/Eigen/Sparse:26,
                 from include/cantera/numerics/eigen_sparse.h:10,
                 from include/cantera/kinetics/StoichManager.h:11,
                 from include/cantera/kinetics/Kinetics.h:14,
                 from src/base/Solution.cpp:12:
/usr/include/eigen3/Eigen/src/Core/arch/AltiVec/PacketMath.h:78:8: internal compiler error: Segmentation fault
   78 | static _EIGEN_DECLARE_CONST_FAST_Packet4i(ZERO, 0); //{ 0, 0, 0, 0,}
      |        ^

Test failure (all other affected systems):

[----------] 2 tests from KineticsAddSpecies3
[ RUN      ] KineticsAddSpecies3.add_species_sequential
test/kinetics/kineticsFromScratch3.cpp:371: Failure
Expected equality of these values:
  k_ref[i]
    Which is: 591054161.41004908
  k[i]
    Which is: 591054161.41004813
i = 0; N = 4
test/kinetics/kineticsFromScratch3.cpp:377: Failure
Expected equality of these values:
  k_ref[i]
    Which is: 150.5822178080069
  k[i]
    Which is: 150.58221780800667
i = 0; N = 4
test/kinetics/kineticsFromScratch3.cpp:384: Failure
Expected equality of these values:
  w_ref[iref]
    Which is: 150.58221780800866
  w[i]
    Which is: 150.58221780800844
sp = O; N = 4
[  FAILED  ] KineticsAddSpecies3.add_species_sequential (5 ms)

System information

Cantera version: fcff5929225a0728dce33e44acf21149f6fb928f
OS: Fedora Linux 36 (Rawhide), Fedora 35, Fedora 34
Python/MATLAB/other software versions: Python 3.9, 3.10 (see logs)

Attachments

Additional context While the build processes are not failing, one test pertaining to kinetics is on both F34 and F35. I suspect that these problems are related. The test failures looks like excessive precision being requested - or is this a truncation/rounding error?

Logs ppc64le/Rawhide - failed build ppc64le/F35 - failed test ppc64le/F34 - failed test

aarch64/Rawhide aarch64/F35 aarch64/F34

Additional information and build logs at: 1) https://copr.fedorainfracloud.org/coprs/fuller/Cantera/build/3192999/ 2) https://koji.fedoraproject.org/koji/taskinfo?taskID=81397194

ischoegl commented 2 years ago

Hi @mefuller … thanks for reporting. One thing that would help narrowing the offending commit down would be to know when the last known build that succeeded was triggered? (I.e. ideally what commit hash)

mefuller commented 2 years ago

Four days ago we were in good shape: https://koji.fedoraproject.org/koji/tasks?owner=fuller&state=all and https://copr.fedorainfracloud.org/coprs/fuller/Cantera/build/3163517/

I'm looking, but not finding a corresponding commit hash. The good news is, the builds on COPR pull the main branch of the official Cantera repo at the time they are run.

ischoegl commented 2 years ago

Ok. #1089 is the likely culprit for this then (sigh). It was merged 4 days ago and the last build likely passed just hours before that merge.

mefuller commented 2 years ago

If it helps, I would be willing to work with you and @bryanwweber (and anyone else) on setting up automated builds for testing with Fedora/EL and multiple architectures - I believe I can provide you with URLs to add as webhooks to trigger builds when you push to main have not tested this yet).

ischoegl commented 2 years ago

Regarding the build failure, it almost looks like this is due to some upstream issue, as it is triggered for an #include statement. You're using system Eigen (3.4.0), whereas the last successful build used 3.3.9. So I am not sure that this has to do with recent changes in Cantera. (#1089 heavily relies on Eigen's sparse matrices, which made me think of this for a moment.)

Regarding the other issues, these happen to be in a part unaffected by recent changes and mainly look like issues related to machine precision. Still curious that this happens all of a sudden.

mefuller commented 2 years ago

Regarding the test, I hadn't been testing on anything other than x86_64 previously, so I can't say for how long the precision issues have been present. Would it be acceptable to modify the tests such that there's more leniency? I'd like to retain the current structure where my builds are marked as failed if the tests fail.

ischoegl commented 2 years ago

Regarding the test, I hadn't been testing on anything other than x86_64 previously, so I can't say for how long the precision issues have been present.

That would explain this!

Would it be acceptable to modify the tests such that there's more leniency?

I think changing offending lines to ASSERT_NEAR may be appropriate in this case.

speth commented 2 years ago

A compiler segfault that seems to have something to do with including one of our dependencies header files is definitely an upstream issue, not a problem that we have any chance of fixing.

I agree that changing those failing comparisons to ASSERT_NEAR would probably be fine, although I would keep the tolerance fairly tight, as the differences should just be the result of a little bit of accumulated rounding error.

bryanwweber commented 2 years ago

@mefuller Thanks for volunteering! I'd been thinking about how to add a Fedora job to our CI here on GitHub Actions. You can specify a container in which the job should run, so I think it should be possible to add a job that pulls a Fedora container from Quay and runs the build and tests inside that. I've been working on other things lately, but it's on my to-do list. If you want to try to figure it out, you can edit .github/workflows/main.yml to add a new job. Thanks!

mefuller commented 2 years ago

@bryanwweber I will definitely take a look. It's also possible to add a webhook in the repo settings for Cantera to trigger a builds in my (test) repo for Fedora, but that's less desirable

speth commented 2 years ago

It's also possible to add a webhook in the repo settings for Cantera to trigger a builds in my (test) repo for Fedora, but that's less desirable

Given that the failures that are being identified here are related to architectures other than x86_64, I wonder if the most useful thing would actually be to trigger these builds elsewhere -- I don't think Github Actions currently provides runners on architectures other than x86_64.

mefuller commented 2 years ago

ok, I took care of the original test errors and now have a few more to deal with - I'll work on a larger PR aimed at getting things working across architectures

bryanwweber commented 2 years ago

I don't think Github Actions currently provides runners on architectures other than x86_64.

This is true, but it can use emulated architectures, as is done for the PyPI packages. That said, if COPR provides the resources, it'd probably be worth having architectures other than x86_64 running over there. I still think it'd be worth having a Fedora build on our GH actions here though.

mefuller commented 2 years ago

I need to ask for a bit more help: On s390x architecture, I get the following block of errors:

----------------------------- Captured stdout call -----------------------------
Solution saved to file /builddir/build/BUILD/cantera-reduce_precision/test/work/python/impingingjet1.yaml as solution 'solution'.
- generated xml file: /builddir/build/BUILD/cantera-reduce_precision/test/work/pytest.xml -
=========================== short test summary info ============================
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:246: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:316: pandas is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:327: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:387: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:373: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:562: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:513: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:521: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:517: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_kinetics.py:142: scipy is not installed
SKIPPED [1] ../../build/python/cantera/test/test_onedim.py:791: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_onedim.py:1350: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_reaction.py:346: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_reactor.py:1490: Integration of sensitivity ODEs is unreliable
XFAIL ../../build/python/cantera/test/test_equilibrium.py::MultiphaseEquilTest::test_equil_gri_lean
  reason: 
XFAIL ../../build/python/cantera/test/test_equilibrium.py::MultiphaseEquilTest::test_equil_gri_stoichiometric
  reason: 
XFAIL ../../build/python/cantera/test/test_equilibrium.py::EquilExtraElements::test_element_potential
  reason: 
XFAIL ../../build/python/cantera/test/test_mixture.py::TestMixture::test_equilibrate2
  reason: 
ERROR ../../build/python/cantera/test/test_composite.py::TestModels::test_load_thermo_models
ERROR ../../build/python/cantera/test/test_composite.py::TestModels::test_restore_thermo_models
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_outunits1
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_outunits2
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_simple
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_surface
FAILED ../../build/python/cantera/test/test_convert.py::ck2yamlTest::test_extra
FAILED ../../build/python/cantera/test/test_convert.py::ck2yamlTest::test_sri_zero
= 6 failed, 1384 passed, 14 skipped, 4 xfailed, 11 warnings, 2 errors in 66.66s (0:01:06) =

e.g. https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-35-s390x/03195208-cantera/builder-live.log.gz

I don't see any useful output. Am I looking in the wrong place and/or are there options I should pass to the tests to get more out?

ischoegl commented 2 years ago

@mefuller ... could you run tests with the SCons flag verbose_tests=y?

ischoegl commented 2 years ago

On second look. The existing log already points to

>   ???
E   ruamel.yaml.reader.ReaderError: unacceptable character #x0000: control characters are not allowed
E     in "/builddir/build/BUILD/cantera-reduce_precision/test/work/python/gri30_extra-from-ck.yaml", position 16384

ruamel.yaml.clib/_ruamel_yaml.pyx:904: ReaderError

meaning that generated output contains some problematic characters. Tracking this down would likely involve extracting the gri_extra-from-ck.yaml from your build environment.

mefuller commented 2 years ago

@ischoegl thanks - I feel pretty dumb now for not seeing all that output above where I was looking. I'll see what I can do.

ischoegl commented 2 years ago

No worries. Fwiw, I just retracted a PR as I didn't realize that the change would force a complete rebuild of Cantera after each commit :cry: ... hindsight (sigh)

mefuller commented 2 years ago

I guess today's a move fast and break things kind of day.

I ran the verbose tests: https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-rawhide-s390x/03196717-cantera/builder-live.log.gz (just in case anyone else wants to take a peek)

speth commented 2 years ago

This error:

_____________ ERROR at setup of TestModels.test_load_thermo_models _____________

cls = <class 'cantera.test.test_composite.TestModels'>

    @classmethod
    def setUpClass(cls):
        utilities.CanteraTest.setUpClass()
        cls.yml_file = cls.test_data_path / "thermo-models.yaml"
>       cls.yml = utilities.load_yaml(cls.yml_file)

../../build/python/cantera/test/test_composite.py:18: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../build/python/cantera/test/utilities.py:35: in load_yaml
    return yaml_.load(stream)
/usr/lib/python3.10/site-packages/ruamel/yaml/main.py:341: in load
    return constructor.get_single_data()
/usr/lib/python3.10/site-packages/ruamel/yaml/constructor.py:111: in get_single_data
    node = self.composer.get_single_node()
ruamel.yaml.clib/_ruamel_yaml.pyx:701: in _ruamel_yaml.CParser.get_single_node
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   ruamel.yaml.reader.ReaderError: unacceptable character #x0000: control characters are not allowed
E     in "/builddir/build/BUILD/cantera-reduce_precision/build/python/cantera/test/data/thermo-models.yaml", position 16384

ruamel.yaml.clib/_ruamel_yaml.pyx:904: ReaderError

just looks like an internal problem with the ruamel.yaml.clib. The file test/data/thermo-models.yaml, which is part of our Git repo, does not contain any null bytes, or even any non-printable characters.

mefuller commented 2 years ago

A compiler segfault that seems to have something to do with including one of our dependencies header files is definitely an upstream issue, not a problem that we have any chance of fixing.

I have filed a bug report with eigen at https://gitlab.com/libeigen/eigen/-/issues/2422

mefuller commented 2 years ago

just looks like an internal problem with the ruamel.yaml.clib. The file test/data/thermo-models.yaml, which is part of our Git repo, does not contain any null bytes, or even any non-printable characters.

I opened a ticket regarding this issue: https://sourceforge.net/p/ruamel-yaml/tickets/417/

AvdN commented 2 years ago

@mefuller If an issue, it is in ruamel-yaml-clib. The 16385 (2^14) is the input buffer size ( https://sourceforge.net/p/ruamel-yaml-clib/code/ci/default/tree/yaml_private.h#l57 ) so maybe this is some issue reading past the buffer only showing up on 390.

I assume you compile ruamel.yaml.clib yourself (as I don't provide any wheels for that architecture), so maybe you can patch a larger number in there.

mefuller commented 2 years ago

@AvdN I've opened a bug report to have the buffer patch tested: https://bugzilla.redhat.com/show_bug.cgi?id=2042422

mefuller commented 2 years ago

A Red Hat ticket has also been opened regarding the Eigen / ppc64le build failure: https://bugzilla.redhat.com/show_bug.cgi?id=2042432

speth commented 2 years ago

I think that last bug should be filed against GCC, not Eigen - an error in Eigen should at worst result in the compiler reporting an error of some sort, not segfaulting.

Cantera / cantera

Builds/Tests now failing on aarch64, ppc64le architectures (Fedora 36=Rawhide) #1174