Closed grantfirl closed 2 months ago
@scrasmussen @mkavulich @dustinswales There are several CI issues outstanding:
Do we want to try to fix any of these in this PR or do it separately?
Also @ligiabernardet @scrasmussen @dustinswales @mkavulich UFS recently updated their modulefiles for Hera: https://github.com/ufs-community/ufs-weather-model/pull/2093. In particular GNU version went from 9 to 13! I'm guessing that we should follow suit. I can do this in this PR. What do you think?
@grantfirl We can move to ubuntu24.04 for the RT test, which has GNU 13. I can help with debugging the CI tests.
@grantfirl I'm still working through the CI issues. Don't let me hold this PR up. I will follow up with a CI PR once it's all working again.
@grantfirl I'm still working through the CI issues. Don't let me hold this PR up. I will follow up with a CI PR once it's all working again.
I'm going to update the Hera modulefiles at least before merging.
@mkavulich We'll need to upload the single precision artifact to the FTP server in order for the single precision RTs to not fail the RT CI script for single precision.
@mkavulich Could you double-check my changes to the Hera module files? I tried to make them compatible with https://github.com/ufs-community/ufs-weather-model/pull/2093. Was there a reason that we needed cmake 3.28.1 or to separately load miniconda? It looks like this is already included in spack-stack 1.6.0.
FYI, when using Hera GNU, I'm getting cmake warnings about policy CMP0074 for ccpp-framework and ccpp-physics. We may need to add an issue to resolve this warning in those repos.
@grantfirl cmake
should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.
Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?
@grantfirl
cmake
should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?
I'm thinking that your proposal would work. I don't know how else to do it.
Okay the "fake" artifacts are in place (baseline and plots). Let me know if you need anything else.
@grantfirl Can you make a quick change before this PR is merged? I got an email from Lara Ziady suggesting I move our staged artifacts to a new location on the web server. It's a one-line change, and shouldn't need any additional testing assuming the tests still pass after this change (I already copied the artifacts to the new location):
diff --git a/.github/workflows/ci_run_scm_rts.yml b/.github/workflows/ci_run_scm_rts.yml
index 7ce607c..319576f 100644
--- a/.github/workflows/ci_run_scm_rts.yml
+++ b/.github/workflows/ci_run_scm_rts.yml
@@ -208,7 +208,7 @@ jobs:
- name: Download SCM RT baselines
run: |
cd ${dir_bl}
- wget https://dtcenter.ucar.edu/ccpp/users/rt/rt-baselines-${{matrix.build-type}}.zip
+ wget https://dtcenter.ucar.edu/ccpp/rt/rt-baselines-${{matrix.build-type}}.zip
unzip rt-baselines-${{matrix.build-type}}.zip
@grantfirl
cmake
should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?
@mkavulich When I first created the Baseline artifacts for the CI , I ran the CI script with the comparison commented out. Then I moved the "Un compared" artifact to the FTP server, unconnected the comparison step, and reran the CI. For the SP tests you can do the same thing to create the baselines. Unfortunately, there is a subset of SDFs for SP, so we can't use the same BL for both SP and the Release/Debug tests. To get around this, one could subset the existing baseline files to only include the tests used by the SP tests, and upload this to the FTP server.
@grantfirl Can you make a quick change before this PR is merged? I got an email from Lara Ziady suggesting I move our staged artifacts to a new location on the web server. It's a one-line change, and shouldn't need any additional testing assuming the tests still pass after this change (I already copied the artifacts to the new location):
diff --git a/.github/workflows/ci_run_scm_rts.yml b/.github/workflows/ci_run_scm_rts.yml index 7ce607c..319576f 100644 --- a/.github/workflows/ci_run_scm_rts.yml +++ b/.github/workflows/ci_run_scm_rts.yml @@ -208,7 +208,7 @@ jobs: - name: Download SCM RT baselines run: | cd ${dir_bl} - wget https://dtcenter.ucar.edu/ccpp/users/rt/rt-baselines-${{matrix.build-type}}.zip + wget https://dtcenter.ucar.edu/ccpp/rt/rt-baselines-${{matrix.build-type}}.zip unzip rt-baselines-${{matrix.build-type}}.zip
Done.
@dustinswales @mkavulich @scrasmussen Unfortunately, there are some runtime failures in the CI RTs for some cases/suites. I cannot replicate the failures locally, so I don't know how to debug. Any ideas?
@grantfirl Looking into the CI failures now.
@grantfirl I also cannot replicate this failure on Hera. Some observations:
@grantfirl I also cannot replicate this failure on Hera. Some observations:
- All failed tests are unsupported for v7. https://github.com/NCAR/ccpp-scm/blob/main/test/rt_test_cases.py#L31
- The RELEASE mode tests fail (2) w/ error-code 1.
- The DEBUG mode tests fail (7) w/ error code 136.
- The SP mode tests fail (1) w/ error code 139. The CI tests use GNU v11, whereas I used GNUv13 on Hera. I'm testing the CI scripts with GNU12 to see if that's the problem. , which if fails I will update the CI script to GNU13, using ubunto24.04.
@grantfirl Some improvement going from GNU11 -> GNU12
I'm going to try GNU13 using Ubuntu24.04 and see what happens.
@grantfirl Same story with GNU13. RELEASE and SP Pass. Errors in DEBUG mode, error code 136. I will look into this more later on today.
@grantfirl For some reason unknown to me, if you apply this change, all the tests run w/o error.
@grantfirl For some reason unknown to me, if you apply this change, all the tests run w/o error.
@dustinswales @mkavulich @scrasmussen This doesn't seem to be the case. With https://github.com/NCAR/ccpp-scm/pull/486/commits/93732dbe9633174decb7591b4aaff2049af2d9f3, I'm still seeing some status 136s and 139s except the output is more verbose. My hunch is that there is an MPI issue causing this with the GitHub workflow somehow. I think that we can debug this more effectively using containers after the release. I don't think that this should hold up anything since we can't replicate failures on any other machines.
@dustinswales I removed the extra verbosity flag for the RTs for now since it was just adding length and made it harder to find failures.
Contains changes from:
https://github.com/NOAA-EMC/fv3atm/pull/816 https://github.com/NOAA-EMC/fv3atm/pull/831 https://github.com/NOAA-EMC/fv3atm/pull/807
Plus:
cdata%thread_cnt
initialization