Open Thomas-Moore-Creative opened 1 year ago
Have to move workflow from OOD
to ARE
a key difference is how to call up cluster resources:
from dask.distributed import Client,Scheduler
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=2,processes=1,memory="23GB",walltime='02:00:00')
client = Client(cluster)
cluster.scale(cores=24)
import dask.config
from dask.distributed import Client,LocalCluster
from dask_jobqueue import PBSCluster
walltime = '01:00:00'
cores = 48
memory = '192GB'
cluster = PBSCluster(walltime=str(walltime), cores=cores, memory=str(memory),processes=cores,
job_extra=['-q normal','-P xv83','-l ncpus='+str(cores),'-l mem='+str(memory),
'-l storage=gdata/xv83+gdata/v14+gdata/ux62+scratch/xv83+gdata/rt52+gdata/ik11+gdata/cj50+gdata/jk72+gdata/hh5'],
local_directory='$TMPDIR',
header_skip=["select"])
cluster.scale(jobs=2)
-rw-rw----+ 1 gay548 ux62 948153768 Mar 5 14:23 /g/data/ux62/access-s2/reanalysis/ocean/temp/mo_temp_2023.nc
time_counter
(time_counter)
datetime64[ns]
2023-01-01T12:00:00 2023-02-01T1...
array(['2023-01-01T12:00:00.000000000', '2023-02-01T12:00:00.000000000'],
dtype='datetime64[ns]')
metadata includes warning : This file will be updated monthly, so take care when using this dataset
corrected SSH
variable across different year files that requires a preprocess hack:def fix_SSHname(ds):
if 'ssh' in ds.data_vars:
ds = ds.rename({'ssh':'ssh_corrected'})
return ds
Started running the workflow on updated data > https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/commit/e92f84facab06399a101d6246886f052124d2943
Trying to progress but single variable that used to take less than 20 seconds is now taking a huge amount of time?
ds_SST = xr.open_mfdataset('/g/data/ux62/access-s2/reanalysis/ocean/sst/mo_sst_*.nc',parallel=True)
Noted are the NCI Advisories on "Network Connectivity Issues" that aren't due to be fixed until May 2nd?!?
Hand checked each variable and turns out that SSH
has duplicate & missing timestamps
Dask server error
( NB: folks are reporting lots of ARE issues since changes to nodes )======================================================================================
Resource Usage on 2023-04-27 12:07:54:
Job Id: 81812983.gadi-pbs
Project: xv83
Exit Status: 271 (Linux Signal 15 SIGTERM Termination)
Service Units: 0.85
NCPUs Requested: 14 NCPUs Used: 14
CPU Time Used: 00:00:45
Memory Requested: 63.0GB Memory Used: 1.91GB
Walltime requested: 05:00:00 Walltime Used: 00:02:54
JobFS requested: 100.0MB JobFS used: 174.33MB
======================================================================================
Have managed to wait for a 2 node cluster - but still got failure that looked memory related?
KilledWorker: ("('open_dataset-concatenate-concatenate-655507e155bb31ca72f94c49a16638a9', 25, 0, 0, 0)", <WorkerState 'tcp://10.6.77.71:34619', name: PBSCluster-1-15, status: closed, memory: 0, processing: 1>)
/g/data/xv83/users/tm4888/data/ACCESS-S2/2023_accessS2_update/accessS2.RA.ocean.nativeTgrid.zarr
has been written.
Using a HugeMem
LocalCluster
starts well but just hangs, mid-write? NOTE: recent problems and changes at NCI have caused similar issues for others.
Zarr
Was sucessful because of the jobfs=400GB
setting. And working on a Saturday means I could get an X-Large HugeMEM node = 700+GB RAM. Used rechunker
approach with LocalCluster
.
zarr
Can run without rechunker
with 700+GB memory.
https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/commit/38d541724fd0112552d4890e661bcc05d7602494
requirements
January - December 2022 only files
update: talking with Laura - 21 June
started off a limited code to only run an incremental update from updated zarr
files.
https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/commit/0e78d48a790a8fdc82bfea176b7e76077f36cf02
Reminding myself of the approach that the BOM uses for regrinding https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/issues/4#issuecomment-1048381442
The steps BoM uses are the same as my plan here:
ncatted -a coordinates,"temp",c,c,"nav_lon nav_lat" tmp_1.nc
cdo -s -L -sellonlatbox,100,200,-50,10 -selname,"temp" tmp_1.nc tmp_2.nc && mv tmp_{2,1}.nc
cdo -s -L remapbil,r1440x720 -selname,"temp" -setmisstonn tmp_1.nc tmp_2.nc && mv tmp_{2,1}.nc
cdo -s -L -sellonlatbox,110,190,-45,5 -selname,"temp" tmp_1.nc tmp_2.nc && mv tmp_{2,1}.nc
cdo -s -f nc4 -z zip copy tmp_1.nc latest_forecast_rg.nc
update: Grant Smith (BoM) has clarified that -setmisstonn
means set missing values to nearest neighbour extrapolation. We can follow this approach with xESMF
.
there is some new bug with xESMF installation ?!? And I'm getting it. https://github.com/pangeo-data/xESMF/issues/269
trying to work this through https://github.com/pangeo-data/xESMF/issues/246
OK, have fixed above problem and exported a very quick and dirty netcdf
file for others to test
test = xr.open_mfdataset('/g/data/xv83/users/tm4888/data/ACCESS-S2/2023_accessS2_update/variables_native_grid/S2RA_update_2022_data_masked_crop_025grid.nc',parallel=True)
ux62
at NCI.zarr
collections for U,V, &T