Closed mefuller closed 2 years ago
Hi @mefuller … thanks for reporting. One thing that would help narrowing the offending commit down would be to know when the last known build that succeeded was triggered? (I.e. ideally what commit hash)
Four days ago we were in good shape: https://koji.fedoraproject.org/koji/tasks?owner=fuller&state=all and https://copr.fedorainfracloud.org/coprs/fuller/Cantera/build/3163517/
I'm looking, but not finding a corresponding commit hash. The good news is, the builds on COPR pull the main branch of the official Cantera repo at the time they are run.
Ok. #1089 is the likely culprit for this then (sigh). It was merged 4 days ago and the last build likely passed just hours before that merge.
If it helps, I would be willing to work with you and @bryanwweber (and anyone else) on setting up automated builds for testing with Fedora/EL and multiple architectures - I believe I can provide you with URLs to add as webhooks to trigger builds when you push to main have not tested this yet).
Regarding the build failure, it almost looks like this is due to some upstream issue, as it is triggered for an #include
statement. You're using system Eigen (3.4.0), whereas the last successful build used 3.3.9. So I am not sure that this has to do with recent changes in Cantera. (#1089 heavily relies on Eigen's sparse matrices, which made me think of this for a moment.)
Regarding the other issues, these happen to be in a part unaffected by recent changes and mainly look like issues related to machine precision. Still curious that this happens all of a sudden.
Regarding the test, I hadn't been testing on anything other than x86_64 previously, so I can't say for how long the precision issues have been present. Would it be acceptable to modify the tests such that there's more leniency? I'd like to retain the current structure where my builds are marked as failed if the tests fail.
Regarding the test, I hadn't been testing on anything other than x86_64 previously, so I can't say for how long the precision issues have been present.
That would explain this!
Would it be acceptable to modify the tests such that there's more leniency?
I think changing offending lines to ASSERT_NEAR
may be appropriate in this case.
A compiler segfault that seems to have something to do with including one of our dependencies header files is definitely an upstream issue, not a problem that we have any chance of fixing.
I agree that changing those failing comparisons to ASSERT_NEAR
would probably be fine, although I would keep the tolerance fairly tight, as the differences should just be the result of a little bit of accumulated rounding error.
@mefuller Thanks for volunteering! I'd been thinking about how to add a Fedora job to our CI here on GitHub Actions. You can specify a container in which the job should run, so I think it should be possible to add a job that pulls a Fedora container from Quay and runs the build and tests inside that. I've been working on other things lately, but it's on my to-do list. If you want to try to figure it out, you can edit .github/workflows/main.yml
to add a new job. Thanks!
@bryanwweber I will definitely take a look. It's also possible to add a webhook in the repo settings for Cantera to trigger a builds in my (test) repo for Fedora, but that's less desirable
It's also possible to add a webhook in the repo settings for Cantera to trigger a builds in my (test) repo for Fedora, but that's less desirable
Given that the failures that are being identified here are related to architectures other than x86_64, I wonder if the most useful thing would actually be to trigger these builds elsewhere -- I don't think Github Actions currently provides runners on architectures other than x86_64
.
ok, I took care of the original test errors and now have a few more to deal with - I'll work on a larger PR aimed at getting things working across architectures
I don't think Github Actions currently provides runners on architectures other than
x86_64
.
This is true, but it can use emulated architectures, as is done for the PyPI packages. That said, if COPR provides the resources, it'd probably be worth having architectures other than x86_64 running over there. I still think it'd be worth having a Fedora build on our GH actions here though.
I need to ask for a bit more help: On s390x architecture, I get the following block of errors:
----------------------------- Captured stdout call -----------------------------
Solution saved to file /builddir/build/BUILD/cantera-reduce_precision/test/work/python/impingingjet1.yaml as solution 'solution'.
- generated xml file: /builddir/build/BUILD/cantera-reduce_precision/test/work/pytest.xml -
=========================== short test summary info ============================
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:246: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:316: pandas is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:327: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:387: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:373: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_composite.py:562: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:513: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:521: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_jacobian.py:517: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_kinetics.py:142: scipy is not installed
SKIPPED [1] ../../build/python/cantera/test/test_onedim.py:791: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_onedim.py:1350: h5py is not installed
SKIPPED [1] ../../build/python/cantera/test/test_reaction.py:346: change of reaction enthalpy is not considered
SKIPPED [1] ../../build/python/cantera/test/test_reactor.py:1490: Integration of sensitivity ODEs is unreliable
XFAIL ../../build/python/cantera/test/test_equilibrium.py::MultiphaseEquilTest::test_equil_gri_lean
reason:
XFAIL ../../build/python/cantera/test/test_equilibrium.py::MultiphaseEquilTest::test_equil_gri_stoichiometric
reason:
XFAIL ../../build/python/cantera/test/test_equilibrium.py::EquilExtraElements::test_element_potential
reason:
XFAIL ../../build/python/cantera/test/test_mixture.py::TestMixture::test_equilibrate2
reason:
ERROR ../../build/python/cantera/test/test_composite.py::TestModels::test_load_thermo_models
ERROR ../../build/python/cantera/test/test_composite.py::TestModels::test_restore_thermo_models
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_outunits1
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_outunits2
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_simple
FAILED ../../build/python/cantera/test/test_composite.py::TestSolutionSerialization::test_yaml_surface
FAILED ../../build/python/cantera/test/test_convert.py::ck2yamlTest::test_extra
FAILED ../../build/python/cantera/test/test_convert.py::ck2yamlTest::test_sri_zero
= 6 failed, 1384 passed, 14 skipped, 4 xfailed, 11 warnings, 2 errors in 66.66s (0:01:06) =
I don't see any useful output. Am I looking in the wrong place and/or are there options I should pass to the tests to get more out?
@mefuller ... could you run tests with the SCons
flag verbose_tests=y
?
On second look. The existing log already points to
> ???
E ruamel.yaml.reader.ReaderError: unacceptable character #x0000: control characters are not allowed
E in "/builddir/build/BUILD/cantera-reduce_precision/test/work/python/gri30_extra-from-ck.yaml", position 16384
ruamel.yaml.clib/_ruamel_yaml.pyx:904: ReaderError
meaning that generated output contains some problematic characters. Tracking this down would likely involve extracting the gri_extra-from-ck.yaml
from your build environment.
@ischoegl thanks - I feel pretty dumb now for not seeing all that output above where I was looking. I'll see what I can do.
No worries. Fwiw, I just retracted a PR as I didn't realize that the change would force a complete rebuild of Cantera after each commit :cry: ... hindsight (sigh)
I guess today's a move fast and break things kind of day.
I ran the verbose tests: https://download.copr.fedorainfracloud.org/results/fuller/cantera-test/fedora-rawhide-s390x/03196717-cantera/builder-live.log.gz (just in case anyone else wants to take a peek)
This error:
_____________ ERROR at setup of TestModels.test_load_thermo_models _____________
cls = <class 'cantera.test.test_composite.TestModels'>
@classmethod
def setUpClass(cls):
utilities.CanteraTest.setUpClass()
cls.yml_file = cls.test_data_path / "thermo-models.yaml"
> cls.yml = utilities.load_yaml(cls.yml_file)
../../build/python/cantera/test/test_composite.py:18:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../build/python/cantera/test/utilities.py:35: in load_yaml
return yaml_.load(stream)
/usr/lib/python3.10/site-packages/ruamel/yaml/main.py:341: in load
return constructor.get_single_data()
/usr/lib/python3.10/site-packages/ruamel/yaml/constructor.py:111: in get_single_data
node = self.composer.get_single_node()
ruamel.yaml.clib/_ruamel_yaml.pyx:701: in _ruamel_yaml.CParser.get_single_node
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E ruamel.yaml.reader.ReaderError: unacceptable character #x0000: control characters are not allowed
E in "/builddir/build/BUILD/cantera-reduce_precision/build/python/cantera/test/data/thermo-models.yaml", position 16384
ruamel.yaml.clib/_ruamel_yaml.pyx:904: ReaderError
just looks like an internal problem with the ruamel.yaml.clib
. The file test/data/thermo-models.yaml
, which is part of our Git repo, does not contain any null bytes, or even any non-printable characters.
A compiler segfault that seems to have something to do with including one of our dependencies header files is definitely an upstream issue, not a problem that we have any chance of fixing.
I have filed a bug report with eigen at https://gitlab.com/libeigen/eigen/-/issues/2422
just looks like an internal problem with the
ruamel.yaml.clib
. The filetest/data/thermo-models.yaml
, which is part of our Git repo, does not contain any null bytes, or even any non-printable characters.
I opened a ticket regarding this issue: https://sourceforge.net/p/ruamel-yaml/tickets/417/
@mefuller If an issue, it is in ruamel-yaml-clib. The 16385 (2^14) is the input buffer size ( https://sourceforge.net/p/ruamel-yaml-clib/code/ci/default/tree/yaml_private.h#l57 ) so maybe this is some issue reading past the buffer only showing up on 390.
I assume you compile ruamel.yaml.clib yourself (as I don't provide any wheels for that architecture), so maybe you can patch a larger number in there.
@AvdN I've opened a bug report to have the buffer patch tested: https://bugzilla.redhat.com/show_bug.cgi?id=2042422
A Red Hat ticket has also been opened regarding the Eigen / ppc64le build failure: https://bugzilla.redhat.com/show_bug.cgi?id=2042432
I think that last bug should be filed against GCC, not Eigen - an error in Eigen should at worst result in the compiler reporting an error of some sort, not segfaulting.
Problem description
All builds on the ppc64le architecture with F36/Rawhide now fail. This was not the case four days ago (see https://copr.fedorainfracloud.org/coprs/fuller/Cantera/builds/).
Concurrently, the
kinetics: KineticsAddSpecies3.add_species_sequential
test now fails on "successful" builds on F34/35 for ppc64le architectures (and also for aarch64, i686 and s390x on all three Fedoras 34, 35, and 36/Rawhide, but not x86_64). This may not be a new problem as I only just added testing automation to the build automation for all architectures. (Yes, I know, I should have done that earlier)Steps to reproduce
scons build && scons test
Behavior
Error message: Build failure (ppc64le:F36):
Test failure (all other affected systems):
System information
Attachments
Additional context While the build processes are not failing, one test pertaining to kinetics is on both F34 and F35. I suspect that these problems are related. The test failures looks like excessive precision being requested - or is this a truncation/rounding error?
Logs ppc64le/Rawhide - failed build ppc64le/F35 - failed test ppc64le/F34 - failed test
aarch64/Rawhide aarch64/F35 aarch64/F34
Additional information and build logs at: 1) https://copr.fedorainfracloud.org/coprs/fuller/Cantera/build/3192999/ 2) https://koji.fedoraproject.org/koji/taskinfo?taskID=81397194