UFS-dev PR#189 - Githubissues

grantfirl commented 3 months ago

Contains changes from:

https://github.com/NOAA-EMC/fv3atm/pull/816 https://github.com/NOAA-EMC/fv3atm/pull/831 https://github.com/NOAA-EMC/fv3atm/pull/807

Plus:

Update to SCM CMakeLists.txt to require MPI (should have been implemented in #482 )
Add cdata%thread_cnt initialization
Combined with #478
Combined with #480

grantfirl commented 3 months ago

@scrasmussen @mkavulich @dustinswales There are several CI issues outstanding:

Nvidia builds need changes for MPI (previous failures caused by the physics bug should be fixed in this PR - see https://github.com/NCAR/ccpp-physics/pull/1075)
My hypothesis for the DEPHY CI test fix didn't pan out. I cannot replicate this failure on other platforms and there isn't much debugging information to go off of. This has also been intermittent with previous PRs. I have no idea.
The Dockerfile needs tweaking for MPI.

Do we want to try to fix any of these in this PR or do it separately?

grantfirl commented 3 months ago

Also @ligiabernardet @scrasmussen @dustinswales @mkavulich UFS recently updated their modulefiles for Hera: https://github.com/ufs-community/ufs-weather-model/pull/2093. In particular GNU version went from 9 to 13! I'm guessing that we should follow suit. I can do this in this PR. What do you think?

dustinswales commented 3 months ago

@grantfirl We can move to ubuntu24.04 for the RT test, which has GNU 13. I can help with debugging the CI tests.

dustinswales commented 3 months ago

@grantfirl I'm still working through the CI issues. Don't let me hold this PR up. I will follow up with a CI PR once it's all working again.

grantfirl commented 3 months ago

@grantfirl I'm still working through the CI issues. Don't let me hold this PR up. I will follow up with a CI PR once it's all working again.

I'm going to update the Hera modulefiles at least before merging.

grantfirl commented 3 months ago

@mkavulich We'll need to upload the single precision artifact to the FTP server in order for the single precision RTs to not fail the RT CI script for single precision.

grantfirl commented 3 months ago

@mkavulich Could you double-check my changes to the Hera module files? I tried to make them compatible with https://github.com/ufs-community/ufs-weather-model/pull/2093. Was there a reason that we needed cmake 3.28.1 or to separately load miniconda? It looks like this is already included in spack-stack 1.6.0.

FYI, when using Hera GNU, I'm getting cmake warnings about policy CMP0074 for ccpp-framework and ccpp-physics. We may need to add an issue to resolve this warning in those repos.

mkavulich commented 3 months ago

@grantfirl cmake should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.

Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?

grantfirl commented 3 months ago

@grantfirl cmake should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.

Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?

I'm thinking that your proposal would work. I don't know how else to do it.

mkavulich commented 3 months ago

Okay the "fake" artifacts are in place (baseline and plots). Let me know if you need anything else.

mkavulich commented 3 months ago

@grantfirl Can you make a quick change before this PR is merged? I got an email from Lara Ziady suggesting I move our staged artifacts to a new location on the web server. It's a one-line change, and shouldn't need any additional testing assuming the tests still pass after this change (I already copied the artifacts to the new location):

diff --git a/.github/workflows/ci_run_scm_rts.yml b/.github/workflows/ci_run_scm_rts.yml
index 7ce607c..319576f 100644
--- a/.github/workflows/ci_run_scm_rts.yml
+++ b/.github/workflows/ci_run_scm_rts.yml
@@ -208,7 +208,7 @@ jobs:
     - name: Download SCM RT baselines
       run: |
         cd ${dir_bl}
-        wget https://dtcenter.ucar.edu/ccpp/users/rt/rt-baselines-${{matrix.build-type}}.zip
+        wget https://dtcenter.ucar.edu/ccpp/rt/rt-baselines-${{matrix.build-type}}.zip
         unzip rt-baselines-${{matrix.build-type}}.zip

dustinswales commented 3 months ago

@grantfirl cmake should be set to version 3.23.1; the newer version is the machine default but we should load 3.23.1 which is the spack-stack version, just for consistency across platforms. I don't think there was a need to also load miniconda, I might have copied that incorrectly from an older module file. If the tests are all working without it then I'd leave it out.

Is there a way to get the artifacts from these failed tests? Right now it's failing with an error because it can't download the data. My thinking is I could create a fake baseline file that's just a copy of the double-precision tests for it to download and compare, which should give us a "failed" test but it should complete without an error, which should get us a real Single-Precision artifact we can upload. Does that sound like a good plan?

@mkavulich When I first created the Baseline artifacts for the CI , I ran the CI script with the comparison commented out. Then I moved the "Un compared" artifact to the FTP server, unconnected the comparison step, and reran the CI. For the SP tests you can do the same thing to create the baselines. Unfortunately, there is a subset of SDFs for SP, so we can't use the same BL for both SP and the Release/Debug tests. To get around this, one could subset the existing baseline files to only include the tests used by the SP tests, and upload this to the FTP server.

grantfirl commented 3 months ago

@grantfirl Can you make a quick change before this PR is merged? I got an email from Lara Ziady suggesting I move our staged artifacts to a new location on the web server. It's a one-line change, and shouldn't need any additional testing assuming the tests still pass after this change (I already copied the artifacts to the new location):
diff --git a/.github/workflows/ci_run_scm_rts.yml b/.github/workflows/ci_run_scm_rts.yml
index 7ce607c..319576f 100644
--- a/.github/workflows/ci_run_scm_rts.yml
+++ b/.github/workflows/ci_run_scm_rts.yml
@@ -208,7 +208,7 @@ jobs:
     - name: Download SCM RT baselines
       run: |
         cd ${dir_bl}
-        wget https://dtcenter.ucar.edu/ccpp/users/rt/rt-baselines-${{matrix.build-type}}.zip
+        wget https://dtcenter.ucar.edu/ccpp/rt/rt-baselines-${{matrix.build-type}}.zip
         unzip rt-baselines-${{matrix.build-type}}.zip

Done.

grantfirl commented 2 months ago

@dustinswales @mkavulich @scrasmussen Unfortunately, there are some runtime failures in the CI RTs for some cases/suites. I cannot replicate the failures locally, so I don't know how to debug. Any ideas?

dustinswales commented 2 months ago

@grantfirl Looking into the CI failures now.

dustinswales commented 2 months ago

@grantfirl I also cannot replicate this failure on Hera. Some observations:

All failed tests are unsupported for v7. https://github.com/NCAR/ccpp-scm/blob/main/test/rt_test_cases.py#L31
The RELEASE mode tests fail (2) w/ error-code 1.
The DEBUG mode tests fail (7) w/ error code 136.
The SP mode tests fail (1) w/ error code 139. The CI tests use GNU v11, whereas I used GNUv13 on Hera. I'm testing the CI scripts with GNU12 to see if that's the problem. , which if fails I will update the CI script to GNU13, using ubunto24.04.

dustinswales commented 2 months ago

@grantfirl I also cannot replicate this failure on Hera. Some observations:

All failed tests are unsupported for v7. https://github.com/NCAR/ccpp-scm/blob/main/test/rt_test_cases.py#L31

The RELEASE mode tests fail (2) w/ error-code 1.

The DEBUG mode tests fail (7) w/ error code 136.

The SP mode tests fail (1) w/ error code 139. The CI tests use GNU v11, whereas I used GNUv13 on Hera. I'm testing the CI scripts with GNU12 to see if that's the problem. , which if fails I will update the CI script to GNU13, using ubunto24.04.

@grantfirl Some improvement going from GNU11 -> GNU12

Only failures (5) are for the same case (ARM-SGP), in DEBUG mode.
No errors with RELEASE and Single_precision modes.
Both supported (GFSv16 and RRFS_v1) and unsupported suites fail this time w/ error code 136.

I'm going to try GNU13 using Ubuntu24.04 and see what happens.

dustinswales commented 2 months ago

@grantfirl Same story with GNU13. RELEASE and SP Pass. Errors in DEBUG mode, error code 136. I will look into this more later on today.

dustinswales commented 2 months ago

@grantfirl For some reason unknown to me, if you apply this change, all the tests run w/o error.

grantfirl commented 2 months ago

@grantfirl For some reason unknown to me, if you apply this change, all the tests run w/o error.

@dustinswales @mkavulich @scrasmussen This doesn't seem to be the case. With https://github.com/NCAR/ccpp-scm/pull/486/commits/93732dbe9633174decb7591b4aaff2049af2d9f3, I'm still seeing some status 136s and 139s except the output is more verbose. My hunch is that there is an MPI issue causing this with the GitHub workflow somehow. I think that we can debug this more effectively using containers after the release. I don't think that this should hold up anything since we can't replicate failures on any other machines.

grantfirl commented 2 months ago

@dustinswales I removed the extra verbosity flag for the RTs for now since it was just adding length and made it harder to find failures.

NCAR / ccpp-scm

UFS-dev PR#189 #486