Thomas-Moore-Creative / NCI-ACCESS-S2-ARD

progress towards analysis ready data (ARD) for the ACCESS-S2 collection at NCI
GNU General Public License v3.0
1 stars 0 forks source link

Write code to regularly update the "ETBF" variables from new ACCESS-S2 RA data on NCI #8

Open Thomas-Moore-Creative opened 1 year ago

Thomas-Moore-Creative commented 1 year ago
Thomas-Moore-Creative commented 1 year ago

Have to move workflow from OOD to ARE a key difference is how to call up cluster resources:

OOD SLURM format

from dask.distributed import Client,Scheduler
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=2,processes=1,memory="23GB",walltime='02:00:00')
client = Client(cluster)
cluster.scale(cores=24)

needs to change to ARE PBS format, like:

import dask.config
from dask.distributed import Client,LocalCluster
from dask_jobqueue import PBSCluster
walltime = '01:00:00'
cores = 48
memory = '192GB'

cluster = PBSCluster(walltime=str(walltime), cores=cores, memory=str(memory),processes=cores,
                     job_extra=['-q normal','-P xv83','-l ncpus='+str(cores),'-l mem='+str(memory),
                                '-l storage=gdata/xv83+gdata/v14+gdata/ux62+scratch/xv83+gdata/rt52+gdata/ik11+gdata/cj50+gdata/jk72+gdata/hh5'],
                     local_directory='$TMPDIR',
                     header_skip=["select"])
cluster.scale(jobs=2)
Thomas-Moore-Creative commented 1 year ago

how current is S2 data?

-rw-rw----+ 1 gay548 ux62 948153768 Mar 5 14:23 /g/data/ux62/access-s2/reanalysis/ocean/temp/mo_temp_2023.nc

time_counter
(time_counter)
datetime64[ns]
2023-01-01T12:00:00 2023-02-01T1...
array(['2023-01-01T12:00:00.000000000', '2023-02-01T12:00:00.000000000'],
      dtype='datetime64[ns]')

As of March 5th, 2023 S2 RA data seems updated through Feb 2023

metadata includes warning : This file will be updated monthly, so take care when using this dataset

Thomas-Moore-Creative commented 1 year ago

WARNING - there is an inconsistency in the name of the corrected SSH variable across different year files that requires a preprocess hack:

def fix_SSHname(ds):
    if 'ssh' in ds.data_vars:
        ds = ds.rename({'ssh':'ssh_corrected'})
    return ds
Thomas-Moore-Creative commented 1 year ago

Started running the workflow on updated data > https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/commit/e92f84facab06399a101d6246886f052124d2943

Thomas-Moore-Creative commented 1 year ago

Trying to progress but single variable that used to take less than 20 seconds is now taking a huge amount of time? ds_SST = xr.open_mfdataset('/g/data/ux62/access-s2/reanalysis/ocean/sst/mo_sst_*.nc',parallel=True)

Noted are the NCI Advisories on "Network Connectivity Issues" that aren't due to be fixed until May 2nd?!?

Thomas-Moore-Creative commented 1 year ago

ValueError: cannot reindex or align along dimension 'time_counter' because the index has duplicate values

Hand checked each variable and turns out that SSH has duplicate & missing timestamps Screen Shot 2023-04-21 at 3 44 11 pm

Thomas-Moore-Creative commented 1 year ago

ARE failure - screen UI Dask server error ( NB: folks are reporting lots of ARE issues since changes to nodes )

======================================================================================
                  Resource Usage on 2023-04-27 12:07:54:
   Job Id:             81812983.gadi-pbs
   Project:            xv83
   Exit Status:        271 (Linux Signal 15 SIGTERM Termination)
   Service Units:      0.85
   NCPUs Requested:    14                     NCPUs Used: 14              
                                           CPU Time Used: 00:00:45        
   Memory Requested:   63.0GB                Memory Used: 1.91GB          
   Walltime requested: 05:00:00            Walltime Used: 00:02:54        
   JobFS requested:    100.0MB                JobFS used: 174.33MB        
======================================================================================
Thomas-Moore-Creative commented 1 year ago

CleanShot 2023-04-27 at 14 17 20

Thomas-Moore-Creative commented 1 year ago

Have managed to wait for a 2 node cluster - but still got failure that looked memory related? KilledWorker: ("('open_dataset-concatenate-concatenate-655507e155bb31ca72f94c49a16638a9', 25, 0, 0, 0)", <WorkerState 'tcp://10.6.77.71:34619', name: PBSCluster-1-15, status: closed, memory: 0, processing: 1>)

Thomas-Moore-Creative commented 1 year ago

success on Tgrid yesterday

/g/data/xv83/users/tm4888/data/ACCESS-S2/2023_accessS2_update/accessS2.RA.ocean.nativeTgrid.zarr has been written.

Failure on UVgrid today.

Using a HugeMem LocalCluster starts well but just hangs, mid-write? NOTE: recent problems and changes at NCI have caused similar issues for others.

Thomas-Moore-Creative commented 1 year ago

Success writing out U & V Zarr

Was sucessful because of the jobfs=400GB setting. And working on a Saturday means I could get an X-Large HugeMEM node = 700+GB RAM. Used rechunker approach with LocalCluster.

Thomas-Moore-Creative commented 1 year ago

have also added code to write out full depth T zarr

Can run without rechunker with 700+GB memory. https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/commit/38d541724fd0112552d4890e661bcc05d7602494

Thomas-Moore-Creative commented 1 year ago

DEADLINE - June 30th

requirements

January - December 2022 only files

update: talking with Laura - 21 June

Thomas-Moore-Creative commented 1 year ago

started off a limited code to only run an incremental update from updated zarr files. https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/commit/0e78d48a790a8fdc82bfea176b7e76077f36cf02

Thomas-Moore-Creative commented 1 year ago

Reminding myself of the approach that the BOM uses for regrinding https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/issues/4#issuecomment-1048381442

The steps BoM uses are the same as my plan here:

  1. crop with padding
  2. bilinear regridding
  3. crop off padding
ncatted -a coordinates,"temp",c,c,"nav_lon nav_lat" tmp_1.nc

cdo -s -L -sellonlatbox,100,200,-50,10 -selname,"temp" tmp_1.nc tmp_2.nc  && mv tmp_{2,1}.nc
cdo -s -L remapbil,r1440x720 -selname,"temp" -setmisstonn tmp_1.nc tmp_2.nc   && mv tmp_{2,1}.nc
cdo -s -L -sellonlatbox,110,190,-45,5 -selname,"temp" tmp_1.nc tmp_2.nc  && mv tmp_{2,1}.nc
cdo -s -f nc4 -z zip copy tmp_1.nc latest_forecast_rg.nc

update: Grant Smith (BoM) has clarified that -setmisstonn means set missing values to nearest neighbour extrapolation. We can follow this approach with xESMF.

Thomas-Moore-Creative commented 1 year ago

Oh No . . .

there is some new bug with xESMF installation ?!? And I'm getting it. https://github.com/pangeo-data/xESMF/issues/269

trying to work this through https://github.com/pangeo-data/xESMF/issues/246

Thomas-Moore-Creative commented 1 year ago

OK, have fixed above problem and exported a very quick and dirty netcdf file for others to test test = xr.open_mfdataset('/g/data/xv83/users/tm4888/data/ACCESS-S2/2023_accessS2_update/variables_native_grid/S2RA_update_2022_data_masked_crop_025grid.nc',parallel=True)