NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
74 stars 167 forks source link

[NCO Bug] Wave post should not copy files to COM whose names might collide #297

Open WalterKolczynski-NOAA opened 3 years ago

WalterKolczynski-NOAA commented 3 years ago

Wave post is copying out_grd* files with the same name from different subdirectories to the rundata directory on COM. These are apparently not used later, so this isn't causing any issues, but is still confusing and can prevent directory reproducibility.

If these files are not actually needed after the job is over, they should not be copied to COM at all. Otherwise, they should be given unique names to avoid collisions.

Refs: Bugzilla #1220

HenryRWinterbottom commented 9 months ago

@WalterKolczynski-NOAA Has this been resolved? In ush/wave_grid_interp_sbs.sh the only copying of files to COM_WAVE_PREP is (line 178):

  cp "${DATA}/output_${ymdh}0000/out_grd.${grdID}" "${COM_WAVE_PREP}/${WAV_MOD_TAG}.out_grd.${grdID}.${PDY}${cyc}"
JessicaMeixner-NOAA commented 9 months ago

I don't think this has been resolved if that line @HenryWinterbottom-NOAA pointed to is still there

HenryRWinterbottom commented 9 months ago

@JessicaMeixner-NOAA OK, thank you. I will remove it. I am not familiar enough with the wave model to know whether this is a breaking change.

JessicaMeixner-NOAA commented 9 months ago

If the output is the same before and after you make this change when running with waves, you can confirm you did not break anything. Removing that line in theory should not break anything, but should be checked of course.

HenryRWinterbottom commented 9 months ago

@JessicaMeixner-NOAA This is my fork: https://github.com/HenryWinterbottom-NOAA/global-workflow/tree/bug/gfsv17_issue_297

Can you please verify that this is a non-breaking change? I have build on RDHPCS Hera so you should be able just to copy my executables.

JessicaMeixner-NOAA commented 9 months ago

@HenryWinterbottom-NOAA I do not currently have the bandwidth to test this right now. I'd be happy to look at output from your tests on develop compared with your branch if you'd like help confirming it does not break anything.

HenryRWinterbottom commented 9 months ago

OK, thanks. I will follow-up once I have something completed.

WalterKolczynski-NOAA commented 9 months ago

@HenryWinterbottom-NOAA For now, keep writing the file to COM but add the forecast hour so there are no longer collisions. We can determine whether the files need to be kept at all separately at a later date.

JessicaMeixner-NOAA commented 9 months ago

@WalterKolczynski-NOAA why should we not check to see if we can simply remove this file?

WalterKolczynski-NOAA commented 9 months ago

@JessicaMeixner-NOAA You would be the one that would be able to tell us that, but given everything works even with all the collisions, my guess is we can.

JessicaMeixner-NOAA commented 9 months ago

Agreed, which is why I suggested to just remove it first. If the gridded wave output looks the same before and after this line removal we can confidently remove it. If something breaks, then we should look into that. Afterall, if the file is needed, simply adding a forecast hour should also break the step that it's reliant. Therefore I think trying to remove it is the best option. If this does break, then I will work to find someone on the waves team who can address this issue as it likely will require SME. That being said I'm fairly confident this can just simply be removed.

WalterKolczynski-NOAA commented 9 months ago

@HenryWinterbottom-NOAA Keep the removal. Run a test in both your branch and develop and compare the gridded wave files. I think you will need to use test/diff_grib_files.py, as wave files have metadata that will not match in a straight diff.

HenryRWinterbottom commented 9 months ago

@JessicaMeixner-NOAA @WalterKolczynski-NOAA I am trying to finish up this issue.

I ran my control (e.g., develop) on RDHPCS Hera, here: /scratch1/NCEPDEV/da/Henry.Winterbottom/work/GLOBAL_WORKFLOW/EXPDIR/x001_gfsv17_issue_297

I am trying to clarify where any collision could occur. Is the "rundata directory on COM" the following?

/scratch1/NCEPDEV/da/Henry.Winterbottom/work/GLOBAL_WORKFLOW/COMROT/x001_gfsv17_issue_297/gfs.20210323/12/model_data/wave/history

JessicaMeixner-NOAA commented 9 months ago

@HenryWinterbottom-NOAA I did not think about this before, but the line you removed is in ush/wave_grid_interp_sbs.sh for the glo_500 grid that you are using. I think you're going to have to re-run this. You can stick with C48 for the atm but you'll have to run mx025 as the wave grid so that it'll invoke this part of the code.

HenryRWinterbottom commented 9 months ago

@JessicaMeixner-NOAA OK, thank you. Note that I haven't removed the line yet. I just ran an experiment to understand the files that I need to be aware of.

JessicaMeixner-NOAA commented 9 months ago

Right - okay so you'll need to use mx025 then I think you'll see the file that will over-write itself in the COM directory. I do think you're pointing to the right spot, it's just you're not invoking that line when you run with glo_500. FYI I'm headed off soon, but I'd be happy to help look again at anything in the new year.

HenryRWinterbottom commented 9 months ago

Sounds good. I will work on this over the break. Thank you for the assistance.

Happy holidays and we'll speak in 2024.

JessicaMeixner-NOAA commented 9 months ago

@HenryWinterbottom-NOAA just wanted to check in on this issue after the break to see if you needed anything from me.

HenryRWinterbottom commented 9 months ago

@JessicaMeixner-NOAA Thank you for following up, and Happy New Year.

I was having issues running the following experiment on RDHPCS Hera:

./setup_expt.py gfs forecast-only --idate 2021070100 --edate 2021070200 --app S2SW --resdet 384 --pslot x003_gfsv17_issue_1252 --expdir /scratch1/NCEPDEV/da/Henry.Winterbottom/work/GLOBAL_WORKFLOW/EXPDIR --comrot /scratch1/NCEPDEV/da/Henry.Winterbottom/work/GLOBAL_WORKFLOW/COMROT

The above configuration does not find the model input files for the forecast.

What configuration are you (or would you be) using to test this issue/PR if you were running on RDHPCS Hera?

JessicaMeixner-NOAA commented 9 months ago

@HenryWinterbottom-NOAA the IDATE that I know has ICs is 2016070100 for C384. However, to test the wave model you need to use mx025 which is the default for 768. C384 default wave model will also not test this configuration. You can run the wave model without an IC, so as long as you force the model past the errors in the ic_staging script and you have all the other ICs set up you could use whatever resolution, such as C48 for your testing + mx025 for the wave model.

HenryRWinterbottom commented 9 months ago

@JessicaMeixner-NOAA I am looking the paths beneath /scratch1/NCEPDEV/global/glopara/data/ICSDIR on RDHPCS Hera and I don't see the ICs.

JessicaMeixner-NOAA commented 9 months ago

Look under /scratch1/NCEPDEV/global/glopara/data/ICSDIR/prototype_ICs/ I ran a case successfully (not for mx025 for the wave grid though) before I went on leave. Output of that is here: /scratch1/NCEPDEV/climate/Jessica.Meixner/HR3/updatemodel/upt02

HenryRWinterbottom commented 9 months ago

@JessicaMeixner-NOAA OK, with some work I ran a baseline and experimental configuration.

This experiment I ran with develop (baseline): /scratch1/NCEPDEV/da/Henry.Winterbottom/work/GLOBAL_WORKFLOW/COMROT/x001_develop/gfs.20210323/12/model_data/wave/

The following experiment, /scratch1/NCEPDEV/da/Henry.Winterbottom/work/GLOBAL_WORKFLOW/COMROT/x005_gfsv17_issue_297/gfs.20210323/12/model_data/wave/ is where the following has been modified:

https://github.com/NOAA-EMC/global-workflow/compare/develop...HenryWinterbottom-NOAA:global-workflow:bug/gfsv17_issue_297?expand=1#diff-cd3125d2e4585d37b4587f53d4d000e1aacc82c54ade1289afde822d1c471a70

Is this the expected result?

JessicaMeixner-NOAA commented 9 months ago

@HenryWinterbottom-NOAA you have used the grid glo_500 not mx025 which is needed to see the effects of this code change. A new baseline and updated experiment is needed with the other wave grid.

aerorahul commented 9 months ago

@HenryWinterbottom-NOAA you probably need a lower resolution configuration to test and fix the problem in this issue. Debugging w/ C768 mx025 will be riddled with issues unrelated to those in this issue.

JessicaMeixner-NOAA commented 9 months ago

@aerorahul I agree, C768 is un-necessary, however mx025 will be required for the wave model given the current options as the glo_ grids that are used as the default options for lower C resolutions do not excersize the part of code that @HenryWinterbottom-NOAA is trying to test. There might not exist an IC for mx025 for the same dates as a C48 option, but the forecast can run without a wave IC.

JessicaMeixner-NOAA commented 3 months ago

Making a comment here that after PR https://github.com/NOAA-EMC/global-workflow/pull/2643 we should also investigate why we need COMOUT_WAVE_PREP in jobs/JGLOBAL_WAVE_POST_SBS these are likely only needed because of issues here and can likely be cleaned out when this is taken care of.

edited to remove comout_wave_grd which is the gridded output directory and needed.