libaccessom2 doesn't deal with netcdf unpacking properly (we think...) in ERA-5 runs

rmholmes commented 1 year ago

As described on the ERA-5 forcing issue I think libaccessom2 may have an issue dealing with netcdf unpacking across file boundaries. I'll summarize the problem here.

The problem occurs when transitioning between two months (the ERA-5 forcing is stored in monthly files), best demonstrated by plotting daily minimum wind stress at 92W, 0N from a 1deg_era5_iaf run spanning 1980-01-01 to 1980-05-01:

Capture

There is a large burst of negative wind stress in the first day of April in the "raw" run (this causes all sorts of crazy stuff...). The add_offset netcdf packing value in the ERA-5 10m zonal winds file is particularly anomalous for March of this year (listed below per month of the files in /g/data/rt52/era5/single-levels/reanalysis/10u/1980/

                u10:add_offset = -3.54318567240813 ;
                u10:add_offset = 0.856332909292688 ;
                u10:add_offset = -32.1081480318141 ;
                u10:add_offset = -0.761652898754254 ;
                u10:add_offset = -0.10650583633675 ;
                u10:add_offset = -2.55211599669929 ;
...

If I change the netcdf packing values in the single March 1980 10m winds file (using the below python) and rerun, then I remove the burst of wind stress ("Altered packing" run above). This confirms to me that it is a packing issue.

file_in = '/g/data/rt52/era5/single-levels/reanalysis/10u/1980/10u_era5_oper_sfc_19800301-19800331.nc'
file_out = '/g/data/e14/rmh561/access-om2/input/ERA-5/IAF/10u/1980/10u_era5_oper_sfc_19800301-19800331.nc'
DS = xr.open_dataset(file_in)
encoding = {}
scale = 0.000966930321007164 # Apr 1980 value
offset = -0.761652898754254 # Apr 1980 value
encoding['u10'] = {'scale_factor': scale, 'add_offset': offset, 'dtype': 'int16'}
DS.to_netcdf(file_out,encoding=encoding)

Yes, the packing in the ERA-5 files is weird. But in any case, libaccessom2 should be able to deal with the variable packing. Xarray in python can, as shown by this plot of the time series of 10m zonal wind at the same point from the original file:

Capture

I've had a quick look through the code and am none the wiser. As @aekiss said, the netcdf unpacking seems to be handled by the netcdf library, so I don't understand how there can be a problem. Clearly it only affects the times between months when an interpolation has to be done. The rest of the month is fine.

rmholmes commented 1 year ago

Note: It doesn't look like this error will have affected the JRA-55 forced runs because these don't seem to use netcdf packing (no add_offset or scale_factor in the forcing files).

access-hive-bot commented 1 year ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/help-from-access-nri-on-libaccessom2-netcdf-packing-issue-for-cosima-era-5-runs/379/1

aidanheerdegen commented 1 year ago

Where is your experiment directory located @rmholmes?

rmholmes commented 1 year ago

/home/561/rmh561/access-om2/1deg_era5_iaf for the run with the issue and /home/561/rmh561/access-om2/1deg_era5_iaf_perror for the altered packing run.

Note this comment about potentially dropping this project, although I'd still love to know what the issue is!

aidanheerdegen commented 1 year ago

Note https://github.com/COSIMA/access-om2/issues/242#issuecomment-1409563618, although I'd still love to know what the issue is!

Even when this is working for MOM6 that is likely through CMEPS or some other nupoc framework, so this problem will still need to be fixed.

For future reference, the code for this in a separate branch of libaccessom2. The specific lines where the scale_factor and offset are applied are here:

https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libutil/src/util.F90#L96-L103

I'd start with adding some logging to make sure those values are as expected.

It is strange that the first time it is read the values are bad, then they're ok.

The other place to look is this ncvar_read_data routine, which has the caching Nic added

https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libforcing/src/ncvar.F90#L303-L343

An off-by-one index error is a candidate, but I can't see it having quite the catastrophic effects seen in the model.

aekiss commented 1 year ago

I've had a close look at subroutine read_data https://github.com/COSIMA/libaccessom2/blob/70e2c46800fd9/libutil/src/util.F90#L46-L107 and ncvar_read_data https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libforcing/src/ncvar.F90#L303-L343 and didn't see anything amiss. But maybe somebody with sharper eyes would do better.

The order of operations in unpacking the data is also correct - see https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html

rmholmes commented 1 year ago

I started adding some logging the other day but didn't get far. Hopefully I can find some time this week.

russfiedler commented 1 year ago

It looked good to me too. I think there may need to be some testing with artificial fields to see what's happening.

rmholmes commented 1 year ago

Please go ahead and do some testing if you can. I'm sure you can both make progress more quickly than I can.

aekiss commented 1 year ago

silly question - will these https://github.com/COSIMA/libaccessom2/blob/70e2c46800fd949d44da7ef85c1866aa0d59e879/libutil/src/util.F90#L96 https://github.com/COSIMA/libaccessom2/blob/70e2c46800fd949d44da7ef85c1866aa0d59e879/libutil/src/util.F90#L100 return double precision? In C and f77 there are nc_get_att_double and nf_get_att_double but AFAICS nf90_get_att_double does not exist, so I guess types are handled automatically by nf90_get_att ...?

aekiss commented 1 year ago

so it would seem https://docs.unidata.ucar.edu/netcdf-fortran/current/f90-attributes.html#f90-get-attributes-values-nf90_get_att

russfiedler commented 1 year ago

My worry is that there is a chance that the compiler could perform the operations out of order since operations are on the inside of the if/endif constructs and it might not understand. The operation should be performed as dataout=scale_factor * dataout + offset anyway and not separately. It's also more efficient to do it in one go. I also detest the fact that scale_factor and offset are never initialised to 1 and 0 respectively.

scale_factor = 1
offset = 0
status1=nf90_get_att(,,,scale_factor)
status2=nf90_get_att(,,,offset)
if(status1==nf90_noerr .or. status2 ==noerr) dataout=scale_factor*dataout + offset

aekiss commented 1 year ago

Ah yes that looks a lot safer and more efficient

rmholmes commented 1 year ago

I'll give it a go.

rmholmes commented 1 year ago

No luck unfortunately:

I'm putting in a PR anyway.

russfiedler commented 1 year ago

Any luck with the (latest) changed version? I wouldn't expect so but there's a chance some logic may be fixed.

rmholmes commented 1 year ago

I'll try it. I doubt there'll be any change.

rmholmes commented 1 year ago

Nope, no change.

aidanheerdegen commented 1 year ago

ACCESS-NRI can spare someone to look at it in a couple of weeks if no-one else has time to check it out.

rmholmes commented 1 year ago

That would be great!

russfiedler commented 1 year ago

@rmholmes Try turning off the data caching for the forcing. Set forcing_field_time_cache_size = 1 in atm.nml rather than 96. It will run slower (maybe) but may give us a clue.

rmholmes commented 1 year ago

Performance is extremely slow like this, but I should be able to get a result by the end of the day.

access-hive-bot commented 1 year ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/cosima-meeting-minutes-2023/407/3

rmholmes commented 1 year ago

No this has not changed things. The error is still there, suggesting that caching is not the issue (or at least the size of the cache).

Note that the performance was really bad - 5.5 hour wall time for a 4 month run, rather than 20 minutes.

russfiedler commented 1 year ago

@rmholmes Damn. I was about to suggest a cache size of 24 so there would be no wrapping across files or resizing of the cache but might perform ok. No point now I think.

aekiss commented 1 year ago

Did the RYF ERA5 runs use a single file, which is why this bug wasn't found in that case?

rmholmes commented 1 year ago

Did the RYF ERA5 runs use a single file, which is why this bug wasn't found in that case?

Correct.

russfiedler commented 1 year ago

Shouldn't this line be a NINT rather than INT? https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libforcing/src/ncvar.F90#L151 Truncation could produce odd results.

Also, what if there is only one time slab in the file. You have a bounds error. This may be checked but I can't find it.

aekiss commented 1 year ago

hm, Nic fixed previous rounding error bugs: https://github.com/COSIMA/libaccessom2/commit/97be20edca39ecc14f8d15bdbd6c8f065c01eed2 Maybe this is another one?

rmholmes commented 1 year ago

So, should I fix it and try again?

aekiss commented 1 year ago

sure, just to see if it helps... not sure if it would fix this particular issue though

rmholmes commented 1 year ago

It didn't help. I made a PR anyway.

russfiedler commented 1 year ago

Ok, I've found a potential problem. In ncvar)read_data if self%data_cache is unallocated (it isn't initialised)but left_to_read = size(self%times) - indx is zero an attempt to read a zero size count array is made. This is legal but just does nothing silently but the scale_factor and add_offset are updated as expected and could lead to the problem. Can you put a check in there to make sure it doesn't happen?

aekiss commented 1 year ago

Thanks @russfiedler, that's a subtle one. I agree that if read_data is called with self%cur_time_cache_size=0 here https://github.com/COSIMA/libaccessom2/blob/17f27949fd/libforcing/src/ncvar.F90#L334-L335 then we might scale and offset dataout here https://github.com/COSIMA/libaccessom2/blob/17f27949fd/libutil/src/util.F90#L96-L102 but is that a problem, given that dataout would be empty?

aekiss commented 1 year ago

Oh hang on, if self%cur_time_cache_size=0, won't this return data from the previous cache, rather than no data at all? https://github.com/COSIMA/libaccessom2/blob/17f27949fd/libforcing/src/ncvar.F90#L341

aekiss commented 1 year ago

Maybe we need this here

call assert(self%cur_time_cache_size > 0, 'Cache is empty')

rmholmes commented 1 year ago

Happy to try this if you think it'll work. I think I agree that something is fishy here (shouldn't this code not even be called if left_to_read=0?).

aekiss commented 1 year ago

It would be interesting to give it a whirl. This is just be a safety mechanism to abort if self%cur_time_cache_size is zero.

If it doesn't abort then we can rule out this issue as causing this problem and will need to look elsewhere.

If it aborts we'd need to work out how to handle that case more gracefully, e.g. (as you suggest) not calling ncvar_read_data when size(self%times) <= indx

rmholmes commented 1 year ago

Yes, that assert statement crashes the model. It does it at the end of Jan/start of Feb (the first time the forcing file swaps over). There isn't a big burst in the wind stress then in the normal run, but that's probably just because the change in the scale_factor and add_offset aren't big. So looking promising!

So how do we fix it? As far as I can see we shouldn't even be trying to read data in in this case. We could put an if around the read_data call, but then the function will just return the empty unallocated self%data_cache in dataout?

aekiss commented 1 year ago

Great! I need to look at the code more carefully, but this seems kind of like an off-by-one kind of issue - ie does size(self%times) <= indx indicate that ~the cache was emptied in the previous call and should have been filled with new data then?~ we got to the end of the file in the previous call and we need to begin reading the next one instead?

aekiss commented 1 year ago

Maybe ncvar_read_data should return a logical flag EOF, and if this is true the calling code should update the field's file via the update method?

rmholmes commented 1 year ago

To confirm this was the issue , I quickly just did:

if (self%cur_time_cache_size > 0) then
   call read_data(self%ncid, self%varid, self%name, indx, &
                 self%cur_time_cache_size, self%data_cache)
endif

in ncvar_read_data and it works, no spike in wind stress! ) I don't actually think this is a bad solution. But I guess the function is returning an unallocated dataout (which might not actually be used). Let me know what you think would be the best solution.

Thanks @russfiedler for finding this!

russfiedler commented 1 year ago

@rmholmes The problem still remains as to why the bad indexing occurs. I'm pretty sure it has been affecting all runs where multiple files have been used for forcing and not just ERA5. The logic is quite convoluted and the code hard to follow.

rmholmes commented 1 year ago

I agree. I've been staring at it for a while today and haven't really been able to figure out how it is working.

aekiss commented 1 year ago

Does this bug affect runs that don't use the libaccessom2 cache?

rmholmes commented 1 year ago

I don't think so, but I'll check.

rmholmes commented 1 year ago

So I ran another simulation with the call assert(self%cur_time_cache_size > 0, 'Cache is empty') present but with forcing_field_time_cache_size = 1. The assert was triggered at the end of the first month. So this suggests to me that it could still be a problem even if we're not using the libaccessom2 cache (i.e. the code is still trying to read in data with time length of zero).

aekiss commented 1 year ago

Thanks, that's helpful to know.

If I understand correctly, forcing_field_time_cache_size = 1 still uses the new cache code, just with a pointlessly small cache. The caching code was created in the 242-era5-support branch which fortunately hasn't yet been merged into master so none of the executables in /g/data/ik11/inputs/access-om2/bin use it.

So the question is, does this error exist in the master branch (non-caching) code? I'm hoping not, so all our existing runs are unaffected...

rmholmes commented 1 year ago

Ah right. I forgot all this code is new to the ERA-5 effort. As far as I can tell from the master branch code, you'll never get a count=0 in the read_data call. ncvar_read_data is just a wrapper on read_data, and read_data always uses count=1; https://github.com/COSIMA/libaccessom2/blob/d750b4bfdc58c59490985c682c1b4c56cc1016b1/libutil/src/util.F90#L61-L80 So that suggests that you can't read in rubbish data for the old runs. However, I guess it's still possible that the indexing could be off by one shifting everything in time by one element.

aidanheerdegen commented 1 year ago

Best practice would be to write a test for this that fails with the old code and passes when the fix is applied.

COSIMA / libaccessom2

libaccessom2 doesn't deal with netcdf unpacking properly (we think...) in ERA-5 runs #78