aidanheerdegen / publish_cosima_data

0 stars 1 forks source link

3D cases also progressing #17

Closed AndyHoggANU closed 4 years ago

AndyHoggANU commented 5 years ago

Some 3D cases also produced output. However, they took about half an hour per 6-month file, so these ones timed out. That is OK, we can extend the time limit, but we still may have to chop things up. The difficulty comes if we have to do it in pieces -- can we get the code to skip over files that have already been created?

The real issue is that some variables died with this error:

/jobfs/local/pbs/mom_priv/jobs/1360257.r-man2.SC: line 6: 11504 Killed                  splitvar -f 6MS -cp -d title -
d grid_type -d grid_tile -a ocean_grid.nc -o /g/data1a/ua8/cosima-tmp/publish --model-type ocean --simname access-om2-
01 --calendar proleptic_gregorian -v age_global /g/data3/hh5/tmp/cosima/access-om2-01/01deg_jra55v13_iaf/output0[0-2]?
/ocean/ocean.nc

which I don't understand. In this case, the output file said:

======================================================================================
                  Resource Usage on 2019-08-20 21:10:40:
   Job Id:             1360257.r-man2
   Project:            v45
   Exit Status:        137 (Linux Signal 9 SIGKILL Kill, unblockable)
   Service Units:      0.07
   NCPUs Requested:    1                      NCPUs Used: 1
                                           CPU Time Used: 00:02:46
   Memory Requested:   24.0GB                Memory Used: 10.64GB
   Walltime requested: 01:00:00            Walltime Used: 00:03:59
   JobFS requested:    100.0MB                JobFS used: 0B
======================================================================================
aidanheerdegen commented 5 years ago

Where are you running the script?

The wall time is 1 hr, so I don't understand why it is timing out after 30 minutes.

I can add logic to not overwrite a file unless specifically asked to do so

AndyHoggANU commented 5 years ago

I don't think this is a wall-time issue - it was actually only 3 minutes before it crashed. It also doesn't seem to be memory, it seems it was just killed??

aidanheerdegen commented 5 years ago

Depending on how it was killed it might be memory and just doesn't show up as that. This is the sort of thing we used to be able to diagnose by looking at the system logs but no longer have access to them.

aidanheerdegen commented 5 years ago

I have updated splitvar so that it won't overwrite files that already exist without specifying --overwrite option.

$ conda list splitvar
# packages in environment at /g/data3/hh5/public/apps/miniconda3/envs/analysis3-19.07:
#
# Name                    Version                   Build  Channel
splitvar                  0.2.6                    py36_0    coecms
[aph502@raijin3 runner]$ splitvar --help
usage: splitvar [-h] [--verbose] [-f FREQUENCY] [--aggregate AGGREGATE]
                [-v VARIABLES] [-x SKIPVARS] [-d DELATTR] [-a ADD]
                [-s SKIPVARS] [-t TITLE] [--simname SIMNAME]
                [--model-type MODELTYPE] [--timeformat TIMEFORMAT]
                [--timeshift [TIMESHIFT]] [--usebounds] [--datefrombounds]
                [--calendar CALENDAR] [-o OUTPUTDIR] [--overwrite] [-cp]
                [--engine ENGINE]
                inputs [inputs ...]

Split multiple netCDF files by time and variable

positional arguments:
  inputs                netCDF files

optional arguments:
  -h, --help            show this help message and exit
  --verbose             Verbose output
  -f FREQUENCY, --frequency FREQUENCY
                        Time period to group for output
  --aggregate AGGREGATE
                        Apply mean in time, using pandas frequency notation
                        e.g Y, 6M, 2Y
  -v VARIABLES, --variables VARIABLES
                        Only extract specified variables
  -x SKIPVARS, --x-variables SKIPVARS
                        Exclude specified variables
  -d DELATTR, --delattr DELATTR
                        Delete specified global attributes
  -a ADD, --add ADD     Read in additional variables from these files
  -s SKIPVARS, --skipvars SKIPVARS
                        Do not extract these variables
  -t TITLE, --title TITLE
                        Title of the simulation, included in metadata
  --simname SIMNAME     Simulation name to include in the filename
  --model-type MODELTYPE
                        Model type to include in the filename
  --timeformat TIMEFORMAT
                        strftime format string for date fields in filename
  --timeshift [TIMESHIFT]
                        Shift time axis by specified amount (in whatever units
                        are used in the file). Default is to automatically
                        shift current start date to time origin
  --usebounds           Use mid point of time bounds for time axis
  --datefrombounds      Use time bounds for filename datestamp
  --calendar CALENDAR   Specify calendar: will replace value of calendar
                        attribute whereever it is found
  -o OUTPUTDIR, --outputdir OUTPUTDIR
                        Output directory in which to store the data
  --overwrite           Overwrite output file if it already exists
  -cp, --copytimeunits  Copy time units from time variable to bounds
  --engine ENGINE       Back-end used to write output files (options are
                        netcdf4 and h5netcdf)

In the case of the tenth it'll still spawn a job, but it will pretty quickly finish as the files should already exist.

AndyHoggANU commented 5 years ago

Does that mean it will even try to read the relevant files in?

aidanheerdegen commented 5 years ago

Yeah it will, as the test is inside the loop where it splits by time/variable.

AndyHoggANU commented 5 years ago

This code now works fine up to 2014. For some reason, we strike a memory issue once we hit 2015.

AndyHoggANU commented 4 years ago

Just returning this -- testing whether we still have this memory issue with the last 3 years of output.

AndyHoggANU commented 4 years ago

I have confirmed that the memory issues with output1[8-9]* still exist for these 3D cases. @aidanheerdegen - any thoughts on how we can diagnose what is happening here?

aidanheerdegen commented 4 years ago

What is the command you're running to generate this error? Do you recall that the same variables were not present in all the files? Specifically aiso_bih is only present from output186 onwards.

AndyHoggANU commented 4 years ago

So, the command was this one:

splitvar --verbose -f $FREQUENCY -cp -d title -d grid_type -d grid_tile -a ocean_grid.nc -o ${OUTPATH} --model-type ${SUBMODEL} --simname ${MODEL} --calendar proleptic_gregorian -v ${var} ${COSIMADIR}/${MODEL}/${EXPT}/output1[8-9]?/${SUBMODEL}/ocean.nc

It was applied to these vars:


vars=(temp salt age_global u v pot_rho_0 pot_rho_2 tx_trans ty_trans ty_trans_submeso tx_trans_rho ty_trans_rho ty_trans_nrho_submeso temp_xflux_adv temp_yflux_adv diff_cbt_t vert_pv )```
so I have indeed removed `aiso_bih` for the reason you suggest.