Closed AndyHoggANU closed 7 years ago
Thanks @AndyHoggANU. This one looks important, it does look like a bug in MATM. I'll get time to look into it by Thursday.
Good grief. The read_core
routine opens (and closes) the forcing file for every read!
That is a lot slower, but by how much I haven't tested.
And the rainfall field has to be the eighth field in the data table, as it is hard coded to be treated differently to the other fields.
Nasty.
@aidanheerdegen @nicjhan Looks like Nic made a start at cleaning up all the netCDF stuff but there's still a nasty mixture of the F77 interface and F90 among other things. Maybe chat about it next Tuesday. I was wondering if there could have been an integer roll over issue since seconds are used in some places but I would have expected that to occur after 68 years.
@russfiedler The intention was always to rewrite MATM completely, possibly not in fortran, but there was no decision made on that front. Hence not much enthusiasm for fixing the existing MATM
@nicjhan was going to submit a draft proposal for the requirements for a new MATM which the TWG and @AndyHoggANU would check and approve with changes if necessary. But I think Nic has been flat out like a lizard drinking and has not had the time to look into this.
very weird that this is failing on a call to open(). This would indicate that it's not a timekeeping problem which I suspected (and worried about). I wonder whether it could be related to this not very well documented problem https://github.com/OceansAus/access-om2/issues/17
If so, it's very good to know that MATM is the culprit and not CICE or MOM.
Good catch.
Could it be this allocate?
There is no corresponding deallocate
That's the era-40 read but the same occurs in the core read. Local arrays should be deallocated automatically on exiting the subroutine.
Thanks Russ, yeah copied the wrong allocate statement. Guess that isn't the issue then.
I found two allocate() calls that did not have a corresponding deallocate. This was in read_core(). I calculated that the memory leak is only 32 bytes per hour, or 1.4 Mb over 5 years so unlikely to cause the model to crash.
OK, that's the same one that you mention.
Perhaps we need to run valgrind over this.
I found this: http://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2017/msg00144.html
The conditions look very similar to our problem. Also the same version of netcdf. Unfortunately there is no resolution.
I cleaned up read_core to use the Fortran 90 interface - that didn't make a difference. Then I tried the latest netcdf library on raijin (on Russ' suggestion) and that has done the trick.
Great, how can I test on my current configuration? Presumably I need to update code and recompile?
I just tried a 160 year run and found that that's more seconds than can fit in a 4 byte integer so the model crashes straight away. So I'll need to fix that. For the time being we're limited to runs up to 135 years.
@AndyHoggANU, yes just recompile the latest matm and you should be good.
Awesome work @nicjhan. I had a feeling you couldn't let that one slide .. :)
I guess all the models need to be updated to use the latest netcdf at some stage.
Thanks @aidanheerdegen and @russfiedler. I've just figured out that OASIS and MCT have a 30 year limitation (less than 1e10 seconds). MCT doesn't define it's own integer types so we'll need to compile OASIS and MCT with -i8 ... I'm not sure where that will lead.
Alternatively we go through OASIS and figure out which ints need to be 4 bytes and which 8
@nicjhan I don't get this. I've run for 500 years in the past with MCT without problems(admittedly older version) and so have others. I can't believe that they've overlooked this and there's a 135 year limit. Am I missing something in what you're saying? Using -i8 seems dangerous to say the least.
@nicjhan You can run for longer if you use different units for the date argument to put_prism_proto etc and you are consistent with namcouple.
– date [INTEGER; IN]: number of seconds (or any other time units as long as the same are used in all components and in the namcouple) at the time of the call (by convention at the beginning of the timestep)
Our convention has been to reference seconds from the start of each leg so we are limited to 30 odd years per leg. There's no limit on the total run.
OK, so, I tried the new MATM which I compiled on Friday, and I still crash with a HDF5 error just before the end of year 5. Perhaps I need to recompile the whole code? Or perhaps I didn't include it properly? Suggestions welcome.
Did you blow away the whole build directory before compiling?
Yes, I deleted the build_jra55 directory (there wasn't a make clean option).
The matm submodule in the access-m2 submodule has not been updated to point at the latest matm version. You can update your matm version by going into the directory and typing
git checkout master
git pull
This will pull from the default remote, or specify a remote to pull from somewhere different.
Unfortunately it doesn't seem to be compiling, I think it is an include path issue on raijin. I'll see if I can fix it.
Scratch that. It didn't compile for me because I was picking up netcdf/4.4.1.1
from ~access/modules
. If the same thing happens for you, try
module unuse ~access/modules
@aidanheerdegen - is this netcdf problem likely to occur for other users? If so should a check be added to https://github.com/OceansAus/access-om2/blob/master/install.sh so the instructions at https://github.com/OceansAus/access-om2/blob/master/README.md will "just work"?
Hi @aidanheerdegen -- this seems to be working now after a few attempts, thanks. I agree with @aekiss that we might need to document this better in the near term. (I don't think I could unpick exactly what I did to get it to compile.)
I've created a new issue to track any discussion of the OASIS / model limitations to long runtimes. @russfiedler I've put your comments in there.
I have been running the 1° ACCESS-OM2 happily all weekend, with timesteps up to 3600 and ~23 walltime mins per model year. So, from day 1 of model year 71, I decided to extend my 2-year simulations up to 5 years, to minimise time spent in the queue. When I did that I had a curious error with MATM:
MATM istep1: 43463 idate: 751218 sec: 0
MATM: error - from NetCDF library Opening /g/data1/ua8/JRA55-do/RYF/v1-3/RYF.snow.1990_1991.nc NetCDF: HDF error
The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38):
...
That is, it got within 13 days of 5 years and then couldn't read the input file. I checked, and got the same error on the same model day!
For now, I have dropped back to 2-year runs and am now beyond year 77. My main worry here is that this issue might be due to a bug in MATM, which usually isn't fatal, but may still be causing problems in other cases. Any ideas?
(If you want to see more error details, look at the jobs that failed on Nov 5 in /home/157/amh157/access-om2/control/1deg_jra55_ryf/archive/error_logs/).