Closed AndyHoggANU closed 4 years ago
Where are you running the script?
The wall time is 1 hr, so I don't understand why it is timing out after 30 minutes.
I can add logic to not overwrite a file unless specifically asked to do so
I don't think this is a wall-time issue - it was actually only 3 minutes before it crashed. It also doesn't seem to be memory, it seems it was just killed??
Depending on how it was killed it might be memory and just doesn't show up as that. This is the sort of thing we used to be able to diagnose by looking at the system logs but no longer have access to them.
I have updated splitvar
so that it won't overwrite files that already exist without specifying --overwrite
option.
$ conda list splitvar
# packages in environment at /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07:
#
# Name Version Build Channel
splitvar 0.2.6 py36_0 coecms
[aph502@raijin3 runner]$ splitvar --help
usage: splitvar [-h] [--verbose] [-f FREQUENCY] [--aggregate AGGREGATE]
[-v VARIABLES] [-x SKIPVARS] [-d DELATTR] [-a ADD]
[-s SKIPVARS] [-t TITLE] [--simname SIMNAME]
[--model-type MODELTYPE] [--timeformat TIMEFORMAT]
[--timeshift [TIMESHIFT]] [--usebounds] [--datefrombounds]
[--calendar CALENDAR] [-o OUTPUTDIR] [--overwrite] [-cp]
[--engine ENGINE]
inputs [inputs ...]
Split multiple netCDF files by time and variable
positional arguments:
inputs netCDF files
optional arguments:
-h, --help show this help message and exit
--verbose Verbose output
-f FREQUENCY, --frequency FREQUENCY
Time period to group for output
--aggregate AGGREGATE
Apply mean in time, using pandas frequency notation
e.g Y, 6M, 2Y
-v VARIABLES, --variables VARIABLES
Only extract specified variables
-x SKIPVARS, --x-variables SKIPVARS
Exclude specified variables
-d DELATTR, --delattr DELATTR
Delete specified global attributes
-a ADD, --add ADD Read in additional variables from these files
-s SKIPVARS, --skipvars SKIPVARS
Do not extract these variables
-t TITLE, --title TITLE
Title of the simulation, included in metadata
--simname SIMNAME Simulation name to include in the filename
--model-type MODELTYPE
Model type to include in the filename
--timeformat TIMEFORMAT
strftime format string for date fields in filename
--timeshift [TIMESHIFT]
Shift time axis by specified amount (in whatever units
are used in the file). Default is to automatically
shift current start date to time origin
--usebounds Use mid point of time bounds for time axis
--datefrombounds Use time bounds for filename datestamp
--calendar CALENDAR Specify calendar: will replace value of calendar
attribute whereever it is found
-o OUTPUTDIR, --outputdir OUTPUTDIR
Output directory in which to store the data
--overwrite Overwrite output file if it already exists
-cp, --copytimeunits Copy time units from time variable to bounds
--engine ENGINE Back-end used to write output files (options are
netcdf4 and h5netcdf)
In the case of the tenth it'll still spawn a job, but it will pretty quickly finish as the files should already exist.
Does that mean it will even try to read the relevant files in?
Yeah it will, as the test is inside the loop where it splits by time/variable.
This code now works fine up to 2014. For some reason, we strike a memory issue once we hit 2015.
Just returning this -- testing whether we still have this memory issue with the last 3 years of output.
I have confirmed that the memory issues with output1[8-9]* still exist for these 3D cases. @aidanheerdegen - any thoughts on how we can diagnose what is happening here?
What is the command you're running to generate this error? Do you recall that the same variables were not present in all the files? Specifically aiso_bih
is only present from output186
onwards.
So, the command was this one:
splitvar --verbose -f $FREQUENCY -cp -d title -d grid_type -d grid_tile -a ocean_grid.nc -o ${OUTPATH} --model-type ${SUBMODEL} --simname ${MODEL} --calendar proleptic_gregorian -v ${var} ${COSIMADIR}/${MODEL}/${EXPT}/output1[8-9]?/${SUBMODEL}/ocean.nc
It was applied to these vars:
vars=(temp salt age_global u v pot_rho_0 pot_rho_2 tx_trans ty_trans ty_trans_submeso tx_trans_rho ty_trans_rho ty_trans_nrho_submeso temp_xflux_adv temp_yflux_adv diff_cbt_t vert_pv )```
so I have indeed removed `aiso_bih` for the reason you suggest.
Some 3D cases also produced output. However, they took about half an hour per 6-month file, so these ones timed out. That is OK, we can extend the time limit, but we still may have to chop things up. The difficulty comes if we have to do it in pieces -- can we get the code to skip over files that have already been created?
The real issue is that some variables died with this error:
which I don't understand. In this case, the output file said: