Open rmholmes opened 1 year ago
Note: It doesn't look like this error will have affected the JRA-55 forced runs because these don't seem to use netcdf packing (no add_offset
or scale_factor
in the forcing files).
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
Where is your experiment directory located @rmholmes?
/home/561/rmh561/access-om2/1deg_era5_iaf
for the run with the issue and /home/561/rmh561/access-om2/1deg_era5_iaf_perror
for the altered packing run.
Note this comment about potentially dropping this project, although I'd still love to know what the issue is!
Note https://github.com/COSIMA/access-om2/issues/242#issuecomment-1409563618, although I'd still love to know what the issue is!
Even when this is working for MOM6 that is likely through CMEPS or some other nupoc framework, so this problem will still need to be fixed.
For future reference, the code for this in a separate branch of libaccessom2
. The specific lines where the scale_factor
and offset
are applied are here:
https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libutil/src/util.F90#L96-L103
I'd start with adding some logging to make sure those values are as expected.
It is strange that the first time it is read the values are bad, then they're ok.
The other place to look is this ncvar_read_data
routine, which has the caching Nic added
https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libforcing/src/ncvar.F90#L303-L343
An off-by-one index error is a candidate, but I can't see it having quite the catastrophic effects seen in the model.
I've had a close look at subroutine read_data
https://github.com/COSIMA/libaccessom2/blob/70e2c46800fd9/libutil/src/util.F90#L46-L107
and ncvar_read_data
https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libforcing/src/ncvar.F90#L303-L343
and didn't see anything amiss. But maybe somebody with sharper eyes would do better.
The order of operations in unpacking the data is also correct - see https://docs.unidata.ucar.edu/netcdf-c/current/attribute_conventions.html
I started adding some logging the other day but didn't get far. Hopefully I can find some time this week.
It looked good to me too. I think there may need to be some testing with artificial fields to see what's happening.
Please go ahead and do some testing if you can. I'm sure you can both make progress more quickly than I can.
silly question - will these
https://github.com/COSIMA/libaccessom2/blob/70e2c46800fd949d44da7ef85c1866aa0d59e879/libutil/src/util.F90#L96
https://github.com/COSIMA/libaccessom2/blob/70e2c46800fd949d44da7ef85c1866aa0d59e879/libutil/src/util.F90#L100
return double precision? In C and f77 there are nc_get_att_double
and nf_get_att_double
but AFAICS nf90_get_att_double
does not exist, so I guess types are handled automatically by nf90_get_att
...?
My worry is that there is a chance that the compiler could perform the operations out of order since operations are on the inside of the if/endif constructs and it might not understand. The operation should be performed as dataout=scale_factor * dataout + offset
anyway and not separately. It's also more efficient to do it in one go.
I also detest the fact that scale_factor
and offset
are never initialised to 1 and 0 respectively.
scale_factor = 1
offset = 0
status1=nf90_get_att(,,,scale_factor)
status2=nf90_get_att(,,,offset)
if(status1==nf90_noerr .or. status2 ==noerr) dataout=scale_factor*dataout + offset
Ah yes that looks a lot safer and more efficient
I'll give it a go.
No luck unfortunately:
I'm putting in a PR anyway.
Any luck with the (latest) changed version? I wouldn't expect so but there's a chance some logic may be fixed.
I'll try it. I doubt there'll be any change.
Nope, no change.
ACCESS-NRI can spare someone to look at it in a couple of weeks if no-one else has time to check it out.
That would be great!
@rmholmes Try turning off the data caching for the forcing. Set
forcing_field_time_cache_size = 1
in atm.nml
rather than 96
. It will run slower (maybe) but may give us a clue.
Performance is extremely slow like this, but I should be able to get a result by the end of the day.
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
https://forum.access-hive.org.au/t/cosima-meeting-minutes-2023/407/3
No this has not changed things. The error is still there, suggesting that caching is not the issue (or at least the size of the cache).
Note that the performance was really bad - 5.5 hour wall time for a 4 month run, rather than 20 minutes.
@rmholmes Damn. I was about to suggest a cache size of 24 so there would be no wrapping across files or resizing of the cache but might perform ok. No point now I think.
Did the RYF ERA5 runs use a single file, which is why this bug wasn't found in that case?
Did the RYF ERA5 runs use a single file, which is why this bug wasn't found in that case?
Correct.
Shouldn't this line be a NINT
rather than INT
? https://github.com/COSIMA/libaccessom2/blob/242-era5-support/libforcing/src/ncvar.F90#L151
Truncation could produce odd results.
Also, what if there is only one time slab in the file. You have a bounds error. This may be checked but I can't find it.
hm, Nic fixed previous rounding error bugs: https://github.com/COSIMA/libaccessom2/commit/97be20edca39ecc14f8d15bdbd6c8f065c01eed2 Maybe this is another one?
So, should I fix it and try again?
sure, just to see if it helps... not sure if it would fix this particular issue though
It didn't help. I made a PR anyway.
Ok, I've found a potential problem. In ncvar)read_data
if self%data_cache
is unallocated (it isn't initialised)but left_to_read = size(self%times) - indx
is zero an attempt to read a zero size count
array is made. This is legal but just does nothing silently but the scale_factor
and add_offset
are updated as expected and could lead to the problem. Can you put a check in there to make sure it doesn't happen?
Thanks @russfiedler, that's a subtle one. I agree that if read_data
is called with self%cur_time_cache_size=0
here
https://github.com/COSIMA/libaccessom2/blob/17f27949fd/libforcing/src/ncvar.F90#L334-L335
then we might scale and offset dataout
here
https://github.com/COSIMA/libaccessom2/blob/17f27949fd/libutil/src/util.F90#L96-L102
but is that a problem, given that dataout
would be empty?
Oh hang on, if self%cur_time_cache_size=0
, won't this return data from the previous cache, rather than no data at all?
https://github.com/COSIMA/libaccessom2/blob/17f27949fd/libforcing/src/ncvar.F90#L341
Maybe we need this here
call assert(self%cur_time_cache_size > 0, 'Cache is empty')
Happy to try this if you think it'll work. I think I agree that something is fishy here (shouldn't this code not even be called if left_to_read=0
?).
It would be interesting to give it a whirl. This is just be a safety mechanism to abort if self%cur_time_cache_size
is zero.
If it doesn't abort then we can rule out this issue as causing this problem and will need to look elsewhere.
If it aborts we'd need to work out how to handle that case more gracefully, e.g. (as you suggest) not calling ncvar_read_data
when size(self%times) <= indx
Yes, that assert statement crashes the model. It does it at the end of Jan/start of Feb (the first time the forcing file swaps over). There isn't a big burst in the wind stress then in the normal run, but that's probably just because the change in the scale_factor
and add_offset
aren't big. So looking promising!
So how do we fix it? As far as I can see we shouldn't even be trying to read data in in this case. We could put an if around the read_data
call, but then the function will just return the empty unallocated self%data_cache
in dataout
?
Great! I need to look at the code more carefully, but this seems kind of like an off-by-one kind of issue - ie does size(self%times) <= indx
indicate that ~the cache was emptied in the previous call and should have been filled with new data then?~ we got to the end of the file in the previous call and we need to begin reading the next one instead?
Maybe ncvar_read_data
should return a logical flag EOF
, and if this is true the calling code should update the field's file via the update
method?
To confirm this was the issue , I quickly just did:
if (self%cur_time_cache_size > 0) then
call read_data(self%ncid, self%varid, self%name, indx, &
self%cur_time_cache_size, self%data_cache)
endif
in ncvar_read_data
and it works, no spike in wind stress!
)
I don't actually think this is a bad solution. But I guess the function is returning an unallocated dataout
(which might not actually be used). Let me know what you think would be the best solution.
Thanks @russfiedler for finding this!
@rmholmes The problem still remains as to why the bad indexing occurs. I'm pretty sure it has been affecting all runs where multiple files have been used for forcing and not just ERA5. The logic is quite convoluted and the code hard to follow.
I agree. I've been staring at it for a while today and haven't really been able to figure out how it is working.
Does this bug affect runs that don't use the libaccessom2 cache?
I don't think so, but I'll check.
So I ran another simulation with the call assert(self%cur_time_cache_size > 0, 'Cache is empty')
present but with forcing_field_time_cache_size = 1
. The assert
was triggered at the end of the first month. So this suggests to me that it could still be a problem even if we're not using the libaccessom2 cache (i.e. the code is still trying to read in data with time length of zero).
Thanks, that's helpful to know.
If I understand correctly, forcing_field_time_cache_size = 1
still uses the new cache code, just with a pointlessly small cache. The caching code was created in the 242-era5-support
branch which fortunately hasn't yet been merged into master so none of the executables in /g/data/ik11/inputs/access-om2/bin
use it.
So the question is, does this error exist in the master branch (non-caching) code? I'm hoping not, so all our existing runs are unaffected...
Ah right. I forgot all this code is new to the ERA-5 effort. As far as I can tell from the master
branch code, you'll never get a count=0
in the read_data
call. ncvar_read_data
is just a wrapper on read_data
, and read_data
always uses count=1
;
https://github.com/COSIMA/libaccessom2/blob/d750b4bfdc58c59490985c682c1b4c56cc1016b1/libutil/src/util.F90#L61-L80
So that suggests that you can't read in rubbish data for the old runs. However, I guess it's still possible that the indexing could be off by one shifting everything in time by one element.
Best practice would be to write a test for this that fails with the old code and passes when the fix is applied.
As described on the ERA-5 forcing issue I think libaccessom2 may have an issue dealing with netcdf unpacking across file boundaries. I'll summarize the problem here.
The problem occurs when transitioning between two months (the ERA-5 forcing is stored in monthly files), best demonstrated by plotting daily minimum wind stress at 92W, 0N from a
1deg_era5_iaf
run spanning 1980-01-01 to 1980-05-01:There is a large burst of negative wind stress in the first day of April in the "raw" run (this causes all sorts of crazy stuff...). The
add_offset
netcdf packing value in the ERA-5 10m zonal winds file is particularly anomalous for March of this year (listed below per month of the files in/g/data/rt52/era5/single-levels/reanalysis/10u/1980/
If I change the netcdf packing values in the single March 1980 10m winds file (using the below python) and rerun, then I remove the burst of wind stress ("Altered packing" run above). This confirms to me that it is a packing issue.
Yes, the packing in the ERA-5 files is weird. But in any case,
libaccessom2
should be able to deal with the variable packing. Xarray in python can, as shown by this plot of the time series of 10m zonal wind at the same point from the original file:I've had a quick look through the code and am none the wiser. As @aekiss said, the netcdf unpacking seems to be handled by the netcdf library, so I don't understand how there can be a problem. Clearly it only affects the times between months when an interpolation has to be done. The rest of the month is fine.