Model crash in year 5 - Githubissues

AndyHoggANU commented 7 years ago

I have been running the 1° ACCESS-OM2 happily all weekend, with timesteps up to 3600 and ~23 walltime mins per model year. So, from day 1 of model year 71, I decided to extend my 2-year simulations up to 5 years, to minimise time spent in the queue. When I did that I had a curious error with MATM:

MATM istep1: 43463 idate: 751218 sec: 0

MATM: error - from NetCDF library Opening /g/data1/ua8/JRA55-do/RYF/v1-3/RYF.snow.1990_1991.nc NetCDF: HDF error

The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38):

The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.

...

That is, it got within 13 days of 5 years and then couldn't read the input file. I checked, and got the same error on the same model day!

For now, I have dropped back to 2-year runs and am now beyond year 77. My main worry here is that this issue might be due to a bug in MATM, which usually isn't fatal, but may still be causing problems in other cases. Any ideas?

(If you want to see more error details, look at the jobs that failed on Nov 5 in /home/157/amh157/access-om2/control/1deg_jra55_ryf/archive/error_logs/).

nichannah commented 7 years ago

Thanks @AndyHoggANU. This one looks important, it does look like a bug in MATM. I'll get time to look into it by Thursday.

aidanheerdegen commented 7 years ago

Good grief. The read_core routine opens (and closes) the forcing file for every read!

https://github.com/OceansAus/matm/blob/d54dfe98a049b08957b8f527f34f6e9920681370/source/atm_read.F90#L71

https://github.com/OceansAus/matm/blob/d54dfe98a049b08957b8f527f34f6e9920681370/source/atm_read.F90#L135

That is a lot slower, but by how much I haven't tested.

And the rainfall field has to be the eighth field in the data table, as it is hard coded to be treated differently to the other fields.

Nasty.

russfiedler commented 7 years ago

@aidanheerdegen @nicjhan Looks like Nic made a start at cleaning up all the netCDF stuff but there's still a nasty mixture of the F77 interface and F90 among other things. Maybe chat about it next Tuesday. I was wondering if there could have been an integer roll over issue since seconds are used in some places but I would have expected that to occur after 68 years.

aidanheerdegen commented 7 years ago

@russfiedler The intention was always to rewrite MATM completely, possibly not in fortran, but there was no decision made on that front. Hence not much enthusiasm for fixing the existing MATM

@nicjhan was going to submit a draft proposal for the requirements for a new MATM which the TWG and @AndyHoggANU would check and approve with changes if necessary. But I think Nic has been flat out like a lizard drinking and has not had the time to look into this.

nichannah commented 7 years ago

very weird that this is failing on a call to open(). This would indicate that it's not a timekeeping problem which I suspected (and worried about). I wonder whether it could be related to this not very well documented problem https://github.com/OceansAus/access-om2/issues/17

If so, it's very good to know that MATM is the culprit and not CICE or MOM.

aidanheerdegen commented 7 years ago

Good catch.

aidanheerdegen commented 7 years ago

Could it be this allocate?

https://github.com/OceansAus/matm/blob/d54dfe98a049b08957b8f527f34f6e9920681370/source/atm_read.F90#L194

There is no corresponding deallocate

russfiedler commented 7 years ago

That's the era-40 read but the same occurs in the core read. Local arrays should be deallocated automatically on exiting the subroutine.

aidanheerdegen commented 7 years ago

Thanks Russ, yeah copied the wrong allocate statement. Guess that isn't the issue then.

nichannah commented 7 years ago

I found two allocate() calls that did not have a corresponding deallocate. This was in read_core(). I calculated that the memory leak is only 32 bytes per hour, or 1.4 Mb over 5 years so unlikely to cause the model to crash.

OK, that's the same one that you mention.

Perhaps we need to run valgrind over this.

nichannah commented 7 years ago

I found this: http://www.unidata.ucar.edu/mailing_lists/archives/netcdfgroup/2017/msg00144.html

The conditions look very similar to our problem. Also the same version of netcdf. Unfortunately there is no resolution.

nichannah commented 7 years ago

I cleaned up read_core to use the Fortran 90 interface - that didn't make a difference. Then I tried the latest netcdf library on raijin (on Russ' suggestion) and that has done the trick.

AndyHoggANU commented 7 years ago

Great, how can I test on my current configuration? Presumably I need to update code and recompile?

nichannah commented 7 years ago

I just tried a 160 year run and found that that's more seconds than can fit in a 4 byte integer so the model crashes straight away. So I'll need to fix that. For the time being we're limited to runs up to 135 years.

@AndyHoggANU, yes just recompile the latest matm and you should be good.

aidanheerdegen commented 7 years ago

Awesome work @nicjhan. I had a feeling you couldn't let that one slide .. :)

I guess all the models need to be updated to use the latest netcdf at some stage.

nichannah commented 7 years ago

Thanks @aidanheerdegen and @russfiedler. I've just figured out that OASIS and MCT have a 30 year limitation (less than 1e10 seconds). MCT doesn't define it's own integer types so we'll need to compile OASIS and MCT with -i8 ... I'm not sure where that will lead.

Alternatively we go through OASIS and figure out which ints need to be 4 bytes and which 8

russfiedler commented 7 years ago

@nicjhan I don't get this. I've run for 500 years in the past with MCT without problems(admittedly older version) and so have others. I can't believe that they've overlooked this and there's a 135 year limit. Am I missing something in what you're saying? Using -i8 seems dangerous to say the least.

russfiedler commented 7 years ago

@nicjhan You can run for longer if you use different units for the date argument to put_prism_proto etc and you are consistent with namcouple.

– date [INTEGER; IN]: number of seconds (or any other time units as long as the same are used in all components and in the namcouple) at the time of the call (by convention at the beginning of the timestep)

Our convention has been to reference seconds from the start of each leg so we are limited to 30 odd years per leg. There's no limit on the total run.

AndyHoggANU commented 7 years ago

OK, so, I tried the new MATM which I compiled on Friday, and I still crash with a HDF5 error just before the end of year 5. Perhaps I need to recompile the whole code? Or perhaps I didn't include it properly? Suggestions welcome.

aidanheerdegen commented 7 years ago

Did you blow away the whole build directory before compiling?

AndyHoggANU commented 7 years ago

Yes, I deleted the build_jra55 directory (there wasn't a make clean option).

aidanheerdegen commented 7 years ago

The matm submodule in the access-m2 submodule has not been updated to point at the latest matm version. You can update your matm version by going into the directory and typing

git checkout master
git pull

This will pull from the default remote, or specify a remote to pull from somewhere different.

Unfortunately it doesn't seem to be compiling, I think it is an include path issue on raijin. I'll see if I can fix it.

aidanheerdegen commented 7 years ago

Scratch that. It didn't compile for me because I was picking up netcdf/4.4.1.1 from ~access/modules. If the same thing happens for you, try

module unuse ~access/modules

aekiss commented 7 years ago

@aidanheerdegen - is this netcdf problem likely to occur for other users? If so should a check be added to https://github.com/OceansAus/access-om2/blob/master/install.sh so the instructions at https://github.com/OceansAus/access-om2/blob/master/README.md will "just work"?

AndyHoggANU commented 7 years ago

Hi @aidanheerdegen -- this seems to be working now after a few attempts, thanks. I agree with @aekiss that we might need to document this better in the near term. (I don't think I could unpick exactly what I did to get it to compile.)

nichannah commented 6 years ago

I've created a new issue to track any discussion of the OASIS / model limitations to long runtimes. @russfiedler I've put your comments in there.

COSIMA / access-om2

Model crash in year 5 #50