Thomas-Moore-Creative commented 1 year ago

[x] keep in mind across this issue that we want to capture instructions and code that can be used transparently by others with NCI project access.
[x] review what is there for new data - how much to update?
[ ] Find and fix all the data quality issues with ACCESS-S2 on ux62 at NCI.
[x] write out full depth zarr collections for U,V, &T
[ ] meet minimum requirements by June 30 - https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/issues/8#issuecomment-1571428649
[ ] run workflow and provide updated "ETBF" files
[ ] review, update and streamline code workflow
[ ] write notebook that documents how to run the workflow
[ ] include environment file & associated documentation

Thomas-Moore-Creative commented 1 year ago

Have to move workflow from OOD to ARE a key difference is how to call up cluster resources:

OOD SLURM format

from dask.distributed import Client,Scheduler
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(cores=2,processes=1,memory="23GB",walltime='02:00:00')
client = Client(cluster)
cluster.scale(cores=24)

needs to change to ARE PBS format, like:

import dask.config
from dask.distributed import Client,LocalCluster
from dask_jobqueue import PBSCluster
walltime = '01:00:00'
cores = 48
memory = '192GB'

cluster = PBSCluster(walltime=str(walltime), cores=cores, memory=str(memory),processes=cores,
                     job_extra=['-q normal','-P xv83','-l ncpus='+str(cores),'-l mem='+str(memory),
                                '-l storage=gdata/xv83+gdata/v14+gdata/ux62+scratch/xv83+gdata/rt52+gdata/ik11+gdata/cj50+gdata/jk72+gdata/hh5'],
                     local_directory='$TMPDIR',
                     header_skip=["select"])
cluster.scale(jobs=2)

Thomas-Moore-Creative commented 1 year ago

how current is S2 data?

-rw-rw----+ 1 gay548 ux62 948153768 Mar 5 14:23 /g/data/ux62/access-s2/reanalysis/ocean/temp/mo_temp_2023.nc

time_counter
(time_counter)
datetime64[ns]
2023-01-01T12:00:00 2023-02-01T1...
array(['2023-01-01T12:00:00.000000000', '2023-02-01T12:00:00.000000000'],
      dtype='datetime64[ns]')

As of March 5th, 2023 S2 RA data seems updated through Feb 2023

metadata includes warning : This file will be updated monthly, so take care when using this dataset

Thomas-Moore-Creative commented 1 year ago

WARNING - there is an inconsistency in the name of the `corrected SSH` variable across different year files that requires a preprocess hack:

def fix_SSHname(ds):
    if 'ssh' in ds.data_vars:
        ds = ds.rename({'ssh':'ssh_corrected'})
    return ds

Thomas-Moore-Creative commented 1 year ago

Started running the workflow on updated data > https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/commit/e92f84facab06399a101d6246886f052124d2943

Thomas-Moore-Creative commented 1 year ago

Trying to progress but single variable that used to take less than 20 seconds is now taking a huge amount of time? ds_SST = xr.open_mfdataset('/g/data/ux62/access-s2/reanalysis/ocean/sst/mo_sst_*.nc',parallel=True)

Noted are the NCI Advisories on "Network Connectivity Issues" that aren't due to be fixed until May 2nd?!?

Thomas-Moore-Creative commented 1 year ago

ValueError: cannot reindex or align along dimension 'time_counter' because the index has duplicate values

Hand checked each variable and turns out that SSH has duplicate & missing timestamps Screen Shot 2023-04-21 at 3 44 11 pm

Thomas-Moore-Creative commented 1 year ago

ARE failure - screen UI `Dask server error` ( NB: folks are reporting lots of ARE issues since changes to nodes )

======================================================================================
                  Resource Usage on 2023-04-27 12:07:54:
   Job Id:             81812983.gadi-pbs
   Project:            xv83
   Exit Status:        271 (Linux Signal 15 SIGTERM Termination)
   Service Units:      0.85
   NCPUs Requested:    14                     NCPUs Used: 14              
                                           CPU Time Used: 00:00:45        
   Memory Requested:   63.0GB                Memory Used: 1.91GB          
   Walltime requested: 05:00:00            Walltime Used: 00:02:54        
   JobFS requested:    100.0MB                JobFS used: 174.33MB        
======================================================================================

Thomas-Moore-Creative commented 1 year ago

CleanShot 2023-04-27 at 14 17 20

Thomas-Moore-Creative commented 1 year ago

Have managed to wait for a 2 node cluster - but still got failure that looked memory related? KilledWorker: ("('open_dataset-concatenate-concatenate-655507e155bb31ca72f94c49a16638a9', 25, 0, 0, 0)", <WorkerState 'tcp://10.6.77.71:34619', name: PBSCluster-1-15, status: closed, memory: 0, processing: 1>)

Thomas-Moore-Creative commented 1 year ago

success on Tgrid yesterday

/g/data/xv83/users/tm4888/data/ACCESS-S2/2023_accessS2_update/accessS2.RA.ocean.nativeTgrid.zarr has been written.

Failure on UVgrid today.

Using a HugeMem LocalCluster starts well but just hangs, mid-write? NOTE: recent problems and changes at NCI have caused similar issues for others.

Thomas-Moore-Creative commented 1 year ago

Success writing out U & V `Zarr`

Was sucessful because of the jobfs=400GB setting. And working on a Saturday means I could get an X-Large HugeMEM node = 700+GB RAM. Used rechunker approach with LocalCluster.

Thomas-Moore-Creative commented 1 year ago

have also added code to write out full depth T `zarr`

Can run without rechunker with 700+GB memory. https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/commit/38d541724fd0112552d4890e661bcc05d7602494

Thomas-Moore-Creative commented 1 year ago

DEADLINE - June 30th

requirements

January - December 2022 only files

[ ] hc300
[ ] mld1
[ ] sst
[ ] temp100
[ ] u100
[ ] v100
[ ] ssh_corrected

update: talking with Laura - 21 June

[x] sst
[x] hc300
[x] temp100
[x] mld1
[ ] u100
[ ] v100

Thomas-Moore-Creative commented 1 year ago

started off a limited code to only run an incremental update from updated zarr files. https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/commit/0e78d48a790a8fdc82bfea176b7e76077f36cf02

Thomas-Moore-Creative commented 1 year ago

Reminding myself of the approach that the BOM uses for regrinding https://github.com/Thomas-Moore-Creative/NCI-ACCESS-S2-ARD/issues/4#issuecomment-1048381442

The steps BoM uses are the same as my plan here:

crop with padding
bilinear regridding
crop off padding

ncatted -a coordinates,"temp",c,c,"nav_lon nav_lat" tmp_1.nc

cdo -s -L -sellonlatbox,100,200,-50,10 -selname,"temp" tmp_1.nc tmp_2.nc  && mv tmp_{2,1}.nc
cdo -s -L remapbil,r1440x720 -selname,"temp" -setmisstonn tmp_1.nc tmp_2.nc   && mv tmp_{2,1}.nc
cdo -s -L -sellonlatbox,110,190,-45,5 -selname,"temp" tmp_1.nc tmp_2.nc  && mv tmp_{2,1}.nc
cdo -s -f nc4 -z zip copy tmp_1.nc latest_forecast_rg.nc

update: Grant Smith (BoM) has clarified that -setmisstonn means set missing values to nearest neighbour extrapolation. We can follow this approach with xESMF.

Thomas-Moore-Creative commented 1 year ago

Oh No . . .

there is some new bug with xESMF installation ?!? And I'm getting it. https://github.com/pangeo-data/xESMF/issues/269

trying to work this through https://github.com/pangeo-data/xESMF/issues/246

Thomas-Moore-Creative commented 1 year ago

OK, have fixed above problem and exported a very quick and dirty netcdf file for others to test test = xr.open_mfdataset('/g/data/xv83/users/tm4888/data/ACCESS-S2/2023_accessS2_update/variables_native_grid/S2RA_update_2022_data_masked_crop_025grid.nc',parallel=True)

Thomas-Moore-Creative / NCI-ACCESS-S2-ARD

Write code to regularly update the "ETBF" variables from new ACCESS-S2 RA data on NCI #8

OOD SLURM format

needs to change to ARE PBS format, like:

how current is S2 data?

As of March 5th, 2023 S2 RA data seems updated through Feb 2023

WARNING - there is an inconsistency in the name of the `corrected SSH` variable across different year files that requires a preprocess hack:

ValueError: cannot reindex or align along dimension 'time_counter' because the index has duplicate values

ARE failure - screen UI `Dask server error` ( NB: folks are reporting lots of ARE issues since changes to nodes )

success on Tgrid yesterday

Failure on UVgrid today.

Success writing out U & V `Zarr`

have also added code to write out full depth T `zarr`

DEADLINE - June 30th

Oh No . . .

Thomas-Moore-Creative / NCI-ACCESS-S2-ARD

Write code to regularly update the "ETBF" variables from new ACCESS-S2 RA data on NCI #8

OOD SLURM format

needs to change to ARE PBS format, like:

how current is S2 data?

As of March 5th, 2023 S2 RA data seems updated through Feb 2023

WARNING - there is an inconsistency in the name of the corrected SSH variable across different year files that requires a preprocess hack:

ValueError: cannot reindex or align along dimension 'time_counter' because the index has duplicate values

ARE failure - screen UI Dask server error ( NB: folks are reporting lots of ARE issues since changes to nodes )

success on Tgrid yesterday

Failure on UVgrid today.

Success writing out U & V Zarr

have also added code to write out full depth T zarr

DEADLINE - June 30th

Oh No . . .

WARNING - there is an inconsistency in the name of the `corrected SSH` variable across different year files that requires a preprocess hack:

ARE failure - screen UI `Dask server error` ( NB: folks are reporting lots of ARE issues since changes to nodes )

Success writing out U & V `Zarr`

have also added code to write out full depth T `zarr`