COSIMA / libaccessom2

ACCESS-OM2 library
3 stars 7 forks source link

libaccessom2 doesn't deal with netcdf unpacking properly (we think...) in ERA-5 runs #78

Open rmholmes opened 2 years ago

rmholmes commented 2 years ago

As described on the ERA-5 forcing issue I think libaccessom2 may have an issue dealing with netcdf unpacking across file boundaries. I'll summarize the problem here.

The problem occurs when transitioning between two months (the ERA-5 forcing is stored in monthly files), best demonstrated by plotting daily minimum wind stress at 92W, 0N from a 1deg_era5_iaf run spanning 1980-01-01 to 1980-05-01:

Capture

There is a large burst of negative wind stress in the first day of April in the "raw" run (this causes all sorts of crazy stuff...). The add_offset netcdf packing value in the ERA-5 10m zonal winds file is particularly anomalous for March of this year (listed below per month of the files in /g/data/rt52/era5/single-levels/reanalysis/10u/1980/

                u10:add_offset = -3.54318567240813 ;
                u10:add_offset = 0.856332909292688 ;
                u10:add_offset = -32.1081480318141 ;
                u10:add_offset = -0.761652898754254 ;
                u10:add_offset = -0.10650583633675 ;
                u10:add_offset = -2.55211599669929 ;
...

If I change the netcdf packing values in the single March 1980 10m winds file (using the below python) and rerun, then I remove the burst of wind stress ("Altered packing" run above). This confirms to me that it is a packing issue.

file_in = '/g/data/rt52/era5/single-levels/reanalysis/10u/1980/10u_era5_oper_sfc_19800301-19800331.nc'
file_out = '/g/data/e14/rmh561/access-om2/input/ERA-5/IAF/10u/1980/10u_era5_oper_sfc_19800301-19800331.nc'
DS = xr.open_dataset(file_in)
encoding = {}
scale = 0.000966930321007164 # Apr 1980 value
offset = -0.761652898754254 # Apr 1980 value
encoding['u10'] = {'scale_factor': scale, 'add_offset': offset, 'dtype': 'int16'}
DS.to_netcdf(file_out,encoding=encoding)

Yes, the packing in the ERA-5 files is weird. But in any case, libaccessom2 should be able to deal with the variable packing. Xarray in python can, as shown by this plot of the time series of 10m zonal wind at the same point from the original file:

Capture

I've had a quick look through the code and am none the wiser. As @aekiss said, the netcdf unpacking seems to be handled by the netcdf library, so I don't understand how there can be a problem. Clearly it only affects the times between months when an interpolation has to be done. The rest of the month is fine.

rmholmes commented 1 year ago

Yep. The code is quite different though, so it's not obvious what that test should be (until we understand better where the bug is coming from - in which case we probably wouldn't need the test anymore)!

russfiedler commented 1 year ago

@rmholmes Do you get a zero from this debugging log? https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libforcing/src/forcing_field.F90#L165-L169

rmholmes commented 1 year ago

The last four lines of work/atmosphere/log/matm... (before the assert is triggered and the job stops) are:

{ "cur_exp-datetime" :  "1980-01-31T22:00:00" }
{ "cur_forcing-datetime" : "1980-01-31T22:00:00" }
cur_runtime_in_seconds    2671200
{ "forcing_field_update-file" : "INPUT/1980/msdwswrf_era5_oper_sfc_19800101-19800131.nc" }
{ "forcing_field_update-index" :        744 }

So looks reasonable.

russfiedler commented 1 year ago

I'm utterly baffled by what's happening here: https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libcouple/src/accessom2.F90#L528-L530 versus https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libcouple/src/accessom2.F90#L542-L544

Why would you set the forcing date back to the start but the run date to the end? And why the >= rather than > in the second (and maybe the first) block?

rmholmes commented 1 year ago

Is the first one in order to deal with the RYF forcing (which loops over a specific period of forcing dates)? I agree that it would make more sense if this was > rather than >= (although shouldn't make a difference).

The second one does not make sense to me. If the experiment date is larger than the end date then the experiment should already have ended. However, I don't feel that this could be responsible for our error, given that we're no where near the end of the run when the issue with read_data occurs.

aekiss commented 1 year ago

Yes the first one makes it repeat the forcing dataset, however long it might be (RYF, RDF, IAF or whatever). I guess it assumes the first and final forcing times can be identified with one another, hence >= not >.

The 2nd one is more mysterious. I guess it's possible to have self%exp_cur_date > self%run_end_date if the timestep isn't an integer fraction of a day. In most cases setting self%exp_cur_date = self%run_end_date will have no effect (the run terminates either way) but if the experiment date is a leap day but the forcing date is not, self%exp_cur_date will be decremented by a day so the model runs for another day: https://github.com/COSIMA/libaccessom2/blame/17f27949fd3ee554b1a66eb343d1130d7f2632d8/libcouple/src/accessom2.F90#L562 (this was added to resolve https://github.com/COSIMA/access-om2/issues/149)

But I agree with Ryan, I don't think this would cause our error.

aekiss commented 1 year ago

@rmholmes is this the fix you used for your latest test run?

Can you provide a link to a commit with your fixed libaccessom2 code so we know what to merge once we're happy with it? Ta!

rmholmes commented 1 year ago

Yep, it's on this branch: https://github.com/rmholmes/libaccessom2/tree/78-ERA5-netcdf-packing, this specific commit: https://github.com/rmholmes/libaccessom2/commit/bfa2062c1c6004f6d04e39042168b39fb474013a

aekiss commented 1 year ago

Great, thanks. Do we understand this code well enough that we're sure this fix doesn't introduce other problems?

rmholmes commented 1 year ago

There is no evidence of any issues arrising, but I can't say for sure no. I don't understand the code well enough to understand what is going wrong.

However, I think it's clear that what the fix does is prevent the scale_factor and add_offset being applied to the data twice - which was previously resulting in completely crazy values (for just one forcing time step). I speculate that the forcing data for that same forcing time step (in between the two months) may still be incorrect. I.e. it could be applying a copy of the forcing from the previous forcing time step again. However, it seems to me that the impact that this kind of error could have on the simulation is very minor.