NOAA-EMC / GSI-Monitor

GSI Monitoring Tools
1 stars 10 forks source link

Update RadMon plotting #47

Closed EdwardSafford-NOAA closed 1 year ago

EdwardSafford-NOAA commented 1 year ago

This PR includes the following changes:

Testing was done on wcoss2, orion, and hera using extracted RadMon data arranged in operational, rocoto workflow, and internal monitor directory structures. Stress testing was also done by plotting with very large numbers of cycles. Note that problems still remain with some of the very large instruments (iasi, cris) if the requested number of cycles is > ~375 (or roughly 3x the default span). If this is found to be a problem for users then a different scheme will have to be adopted.

EdwardSafford-NOAA commented 1 year ago

Converted this to PR draft while I figure out why hdf5 now fails to build in the intel ci test.

EdwardSafford-NOAA commented 1 year ago

@aerorahul I hate to bug you but I need some guidance on the intel ci build failure here. The failures begin with building the hdf5 library (line 770 in the log). I can't see anything in the ci/spack.yaml and the .github/workflow/intel.yml files that might be causing this. Thoughts?

I can add a bit more detail. The intel ci test is trying to load hdf5-1.12.2. It doesn't find a binary for that and then attempts to build hdf5, which fails. My guess is that this is a version problem, since hpc-stack includes hdf5-1.10.6. I've compared the GSI-monitor/ci/spack.yaml and workflow/intel.yaml to the same files in GSI and don't see a difference. Neither one explicitly loads hdf5, but the GSI intel test is working.

Additional update: I created #49, a draft PR with a single, insignificant change, to run the ci tests. Curiously both the intel and gcc tests failed. The intel test failed in the same place, indicating that the cause of failure is not any of the changes here in #47. That's what I expected -- this PR contains only changes to scripts, not code. The gcc failure is confusing. I re-ran the gcc test here in #47 and it works. I don't know what to make of that.

EdwardSafford-NOAA commented 1 year ago

@DavidHuber-NOAA please take a look when you have a chance. I'm still not sure what's up with the ci intel test failure, the actual intel builds on wcoss2, orion, and hera are fine. Thanks.

aerorahul commented 1 year ago

@edwardhartnett @AlexanderRichert-NOAA This repository uses spack to build the dependencies. It has been failing for a while. Any assistance from the spack team is appreciated.

AlexanderRichert-NOAA commented 1 year ago

@edwardhartnett @AlexanderRichert-NOAA This repository uses spack to build the dependencies. It has been failing for a while. Any assistance from the spack team is appreciated.

I'll take a look. Any idea when it started failing, or if there was a change that precipitated it?

EdwardSafford-NOAA commented 1 year ago

@AlexanderRichert-NOAA it first failed for me on 11/22. Intel test only. Today the gcc test failed when I ran a control test containing no changes to develop, though that same test ran to completion in this PR today. The last time I can for sure say both tests ran to completion was 11/3.

AlexanderRichert-NOAA commented 1 year ago

@EdwardSafford-NOAA Is this run a fair example of when it was working? https://github.com/NOAA-EMC/GSI-Monitor/actions/runs/3386726189/jobs/5626502735

It's a bit of a shot in the dark, but the OS version associated with ubuntu-latest has changed recently (I think it's been going back and forth some?). The failed runs all appear to be using Ubuntu 22, and the handful of successful ones I've looked at used Ubuntu 20. Just to narrow things down, maybe you could try specifying ubuntu-20.04. If that works, it's not a good long-term solution, but we could look at intel oneapi version numbers, any intel-specific logic in hdf5/spack, etc.

EdwardSafford-NOAA commented 1 year ago

@AlexanderRichert-NOAA that worked. I changed ubuntu-latest to ubuntu-20.04 and it ran to completion. Huzah! Thanks, and please let me know if/when I should go back to using ubuntu-latest.

AlexanderRichert-NOAA commented 1 year ago

@EdwardSafford-NOAA The short answer is that we want to switch back ASAP, so we should keep hunting for the root issue. I'll look at this some more today/tomorrow and see if I can figure out what might be going on.

EdwardSafford-NOAA commented 1 year ago

Hmmm. I think there's something flaky with the argument processing. If I copy and paste to run your command I get the same error. But if I reorder it slightly to ./RadMon_IG_glb.sh test_radmon -p 2020080500 -r gdas -t /scratch1/NESDIS/nesdis-rdo2/David.Huber/monitor -n 20, I get this:

OK to plot
span is start_date to pdate = 2020073100, 2020080500

I'll take a deeper look.

EdwardSafford-NOAA commented 1 year ago

Oh I see the problem. I think you have the wrong suffix in your command -- I think you want test_radmon instead of test_monitor.

DavidHuber-NOAA commented 1 year ago

Roger that ID-10-T error. Thanks.

I am getting a new error now: Unable to set ieee_src, aborting plot

DavidHuber-NOAA commented 1 year ago

This may be another error on my front. I was only able to run 2 cycles and tried spoofing the rest of the data via copies and renames. But without modifying contents of the ctl files, I could be throwing the plotting software off.

EdwardSafford-NOAA commented 1 year ago

That could be the case. Which cycles have the real data? And if you don't mind, can you open the file permissions so I can try running with your data? I'd really like to rule out a bug.

DavidHuber-NOAA commented 1 year ago

The real data is in cycles 2020080100-2020080106. You should have read permissions for all of the data in /scratch1/NESDIS/nesids-rdo2/David.Huber/monitor. Do you need more than that?

I did try running on just those two cycles and I'm now receiving an error from mk_bcoef_plots.sh:

All bcoef control files are missing from /scratch1/NESDIS/nesdis-rdo2/David.Huber/monitor/stats/test_radmon for requested date range.

DavidHuber-NOAA commented 1 year ago

Ah, I see that the issue is on line 28. Integer 1/4 is 0, so ndays=0.

EdwardSafford-NOAA commented 1 year ago

@DavidHuber-NOAA I implemented a fix for the ndays=0 case. Thanks for checking a minimal data situation -- I was totally focused on the maximum case for this release. Please take another look when you have a moment.