OceanParcels / Parcels

Main code for Parcels (Probably A Really Computationally Efficient Lagrangian Simulator)
https://www.oceanparcels.org
MIT License
295 stars 137 forks source link

Multiple errors when writing .zarr files #1322

Closed DEHewitt closed 1 year ago

DEHewitt commented 1 year ago

Hi,

cc: @sandra-neubert

I am running some simulations on a HPC (Katana at UNSW) and getting multiple, but I think related, errors when Parcels tries to write the output as a .zarr file. I have contacted IT support at our university, but there was a recent upgrade to the HPC, so they are a little slow replying to tickets at the moment.

I haven't run any simulations since the latest release where .zarr was implemented. I have copied the errors and script below, but will provide a brief summary of my approach: we are aiming to release 1 particle per 1x1 degree square in a near-global model (OFES). This is over a period of ~70 years, so to keep computation times low, they're being run as an array of jobs where we set the runtime differently for each job by:

all possible combinations of the two vectors

years = np.repeat(years, len(months)) months = np.repeat(months, 70)

and defining the `runtime` based on the `start_time` and `end_time`
```python
start_time = datetime(years[index], months[index], 15) #year, month, day,
end_time = datetime(years[index], months[index], 18) # 3 days so all particles can 'die'

runtime = end_time-start_time + delta(days = 5) 

Full script

import os
import zarr # version 2.14.1
import numpy as np
import xarray as xr
#import geopandas as gp
import pandas as pd

from parcels import FieldSet, Field, ParticleSet, Variable, JITParticle # version 2.4.0
from parcels import AdvectionRK4, ErrorCode

import math
from datetime import timedelta as delta
from datetime import datetime 
from operator import attrgetter

# import the index variable from the .pbs script
index = int(os.environ['PBS_ARRAY_INDEX']) # a value between 0-827 (69 years * 12 months = 828 jobs)

# set up the array of years and months
years = [year for year in range(1950, 2020)]
months = [month for month in range(1, 13)]

# all possible combinations of the two vectors
years = np.repeat(years, len(months))
months = np.repeat(months, 70)

#dataPath = "C:\\Users\\sandr\\Documents\\Github\ThesisSandra\\Analysis\\Movement\\TracerDataAndOutput\\OFES\\"
dataPath = "/srv/scratch/z9902002/OFES/" # jases scratch
#dataPath = "C:/Users/Dan/Downloads/"
#ufiles = dataPath + "OfESncep01globalmmeanu20152019MS.nc"
ufiles = dataPath + str(years[index]) + "_uvel.nc"
#vfiles = dataPath + "OfESncep01globalmmeanv20152019MS.nc"
vfiles = dataPath + str(years[index]) + "_vvel.nc"

filenames = {'U': ufiles,
             'V': vfiles}

variables = {'U': 'uvel', # had to change 'u' here to 'uvel' which is the name of the variable in the netcdf
             'V': 'vvel'}

dimensions = {'lat': 'latitude',
              'lon': 'longitude',
              'time': 'time'}

#StartLocations = pd.read_csv('C:\\Users\\sandr\\Documents\\Github\\ThesisSandra\\Analysis\\Movement\\Data\\dfOFESStartLocationsGlobal2.csv')
StartLocations = pd.read_csv("/srv/scratch/z5278054/particle-tracking-sandra/Data/dfOFESStartLocationsGlobal2.csv")
#StartLocations = pd.read_csv("C:/Users/Dan/OneDrive - UNSW/Documents/PhD/Dispersal/github/ThesisSandra/Analysis/Movement/Data/dfOFESStartLocationsGlobal2.csv")
StartLocations = StartLocations[['lon','lat']]

fieldset = FieldSet.from_netcdf(filenames, variables, dimensions, deferred_load = False) # deferred load = False for on the fly transformation of OFES
fieldset.add_constant('maxage', 3.*86400) #get rid of particles after 3 days
fieldset.add_periodic_halo(zonal=True) #to not get artifacts around prime meridian (linked to kernel further down)

fieldset.U.data = fieldset.U.data/100
fieldset.V.data = fieldset.V.data/100

#fieldset.add_constant('maxage', 2.*86400) #get rid of particles after 3 days

lon_array = StartLocations.lon
lat_array = StartLocations.lat

npart = 1 #how many particles are released at each location (every time)
lon = np.repeat(lon_array, npart)
lat = np.repeat(lat_array, npart)

# How often to release the particles; 
#Probelm: if I release particles over a long period of time, setting the repeatdt to 30 days leads to particles being released on a different day each month and it gets worse with time
#if I set repeatdt at 30.4375, release dates stay around the same
#repeatdt = delta(days = 30.4375) # release from the same set of locations every months

start_time = datetime(years[index], months[index], 15) #year, month, day,
end_time = datetime(years[index], months[index], 18) # 3 days so all particles can 'die'

runtime = end_time-start_time + delta(days = 5) #add some days at the end to make sure tracking can be done for 5 days from the last start location onwards if release date is not exactly on 15th

time = 0 #np.arange(0, npart) * delta(days = 30.4375).total_seconds() 

class SampleParticle(JITParticle):         # Define a new particle class
        sampled = Variable('sampled', dtype = np.float32, initial = 0, to_write=False)
        age = Variable('age', dtype=np.float32, initial=0.) # initialise age
        distance = Variable('distance', initial=0., dtype=np.float32)  # the distance travelled
        prev_lon = Variable('prev_lon', dtype=np.float32, to_write=False,
                            initial=0)  # the previous longitude
        prev_lat = Variable('prev_lat', dtype=np.float32, to_write=False,
                            initial=0)  # the previous latitude
        #beached = Variable('beached', dtype = np.float32, initial = 0)

def DeleteParticle(particle, fieldset, time): #needed to avoid error mesasage of Particle out of bounds
    particle.delete()

# Define all the sampling kernels
def SampleDistance(particle, fieldset, time):
    # Calculate the distance in latitudinal direction (using 1.11e2 kilometer per degree latitude)
    lat_dist = (particle.lat - particle.prev_lat) * 1.11e2
    # Calculate the distance in longitudinal direction, using cosine(latitude) - spherical earth
    lon_dist = (particle.lon - particle.prev_lon) * 1.11e2 * math.cos(particle.lat * math.pi / 180)
    # Calculate the total Euclidean distance travelled by the particle
    particle.distance += math.sqrt(math.pow(lon_dist, 2) + math.pow(lat_dist, 2))
    particle.prev_lon = particle.lon  # Set the stored values for next iteration.
    particle.prev_lat = particle.lat

def SampleAge(particle, fieldset, time):
    particle.age = particle.age + math.fabs(particle.dt)
    if particle.age >= fieldset.maxage: #if not >= : get one more particle tracking point after maxage
           particle.delete()

def periodicBC(particle, fieldset, time):
    if particle.lon < 0:
        particle.lon += 360 - 0
    elif particle.lon > 359.9:
        particle.lon -= 360 - 0

# def Unbeaching(particle, fieldset, time):
# #     if particle.age == 0 and particle.u_vel == 0 and particle.v_vel == 0: # velocity = 0 means particle is on land so nudge it eastward
# #         particle.lon += random.uniform(0.5, 1) #dont need this because I know my particles dont start on land?
#     if particle.u_vel == 0 and particle.v_vel == 0: # if a particle is advected on to land so mark it as beached (=1)
#         particle.beached = 1

def SampleInitial(particle, fieldset, time): # do we have to add particle.age and particle.ageRise
        if particle.sampled == 0:
            particle.distance = particle.distance
            particle.prev_lon = particle.lon
            particle.prev_lat = particle.lat
            #particle.beached = particle.beached
            particle.sampled = 1

pset = ParticleSet.from_list(fieldset, 
                             pclass=SampleParticle, 
                             time=time, # should this be start_time?
                             lon=lon, 
                             lat=lat)#,
                            # repeatdt=repeatdt)

kernels = SampleInitial + pset.Kernel(AdvectionRK4) + periodicBC + SampleAge + SampleDistance

# where to save the data on the HPC
localPath = "/srv/scratch/z5278054/particle-tracking-sandra/Output/"

output_nc_dist = localPath + str(years[index]) + '-' + str(months[index]) + 'NearGlobalParticleTrackingOFES.zarr'

try:
    os.remove(output_nc_dist)
except OSError:
    pass

file_dist = pset.ParticleFile(name=output_nc_dist, 
                                outputdt=delta(hours=6)) #save location every 6 hours

pset.execute(kernels,  
             runtime=runtime,
             dt=delta(minutes=10), #to reduce computational load
             output_file=file_dist,
             recovery={ErrorCode.ErrorOutOfBounds: DeleteParticle})

parcels_dist = xr.open_dataset(output_nc_dist)

dfParcels = parcels_dist.to_dataframe()
dfParcels.to_csv(localPath + str(years[index]) + '-' + str(months[index]) +  '-dfParcelsGlobal.csv') #local path for HPC 

#for i in range(1, 13):
 #   monthlyData = parcels_dist.where(
  #  parcels_dist['time.month'] == i, drop=True)
   # monthlyData.to_netcdf(localPath + 'ParcelsGlobalYear' + str(YEAR) + 'Month' + str(i) + '.nc') #ADD year variable
    #print(i)

parcels_dist.to_netcdf(localPath + str(years[index]) + '-' + str(months[index]) + '-ParcelsOutput.nc')

Error 1 When I navigate to the directory '/srv/scratch/z5278054/particle-tracking-sandra/Output/1950-1NearGlobalParticleTrackingOFES.zarr/time/.zarray' I can see that the file does exist.

WARNING: The zonal halo is located at the east and west of current grid, with a dx = lon[1]-lon[0] between the last nodes of the original grid and the first ones of the halo. In your grid, lon[1]-lon[0] != lon[-1]-lon[-2]. Is the halo computed as you expect?
INFO: Compiled ArraySampleParticleSampleInitialAdvectionRK4periodicBCSampleAgeSampleDistance ==> /scratch/pbs.4103443[6].kman.restech.unsw.edu.au/parcels-15278054/libce4463ed67805910bab367c4eddbc342_0.so
INFO: Output files are stored in /srv/scratch/z5278054/particle-tracking-sandra/Output/1950-1NearGlobalParticleTrackingOFES.zarr.

  0%|          | 0/691200.0 [00:00<?, ?it/s]
  6%|▋         | 43200.0/691200.0 [01:26<21:40, 498.16it/s]
  6%|▋         | 43200.0/691200.0 [01:39<21:40, 498.16it/s]
  9%|▉         | 64800.0/691200.0 [02:56<30:20, 344.16it/s]
  9%|▉         | 64800.0/691200.0 [03:09<30:20, 344.16it/s]Traceback (most recent call last):
  File "/home/z5278054/HPCParticleTrackingScript.py", line 158, in <module>
    pset.execute(kernels,  
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particleset/baseparticleset.py", line 493, in execute
    output_file.write(self, time)
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particlefile/baseparticlefile.py", line 284, in write
    Z[varout].vindex[ids, obs] = pset.collection.getvardata(var, indices_to_write)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/hierarchy.py", line 438, in __getitem__
    return Array(self._store, read_only=self._read_only, path=path,
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 217, in __init__
    self._load_metadata()
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 234, in _load_metadata
    self._load_metadata_nosync()
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 243, in _load_metadata_nosync
    meta_bytes = self._store[mkey]
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/storage.py", line 1085, in __getitem__
    return self._fromfile(filepath)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/storage.py", line 1059, in _fromfile
    with open(fn, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/srv/scratch/z5278054/particle-tracking-sandra/Output/1950-1NearGlobalParticleTrackingOFES.zarr/time/.zarray'

  9%|▉         | 64800.0/691200.0 [04:22<42:15, 247.06it/s]

Error 2

WARNING: The zonal halo is located at the east and west of current grid, with a dx = lon[1]-lon[0] between the last nodes of the original grid and the first ones of the halo. In your grid, lon[1]-lon[0] != lon[-1]-lon[-2]. Is the halo computed as you expect?
INFO: Compiled ArraySampleParticleSampleInitialAdvectionRK4periodicBCSampleAgeSampleDistance ==> /scratch/pbs.4103443[11].kman.restech.unsw.edu.au/parcels-15278054/libac209b106ba70324e8b248d28ae66843_0.so
INFO: Output files are stored in /srv/scratch/z5278054/particle-tracking-sandra/Output/1950-1NearGlobalParticleTrackingOFES.zarr.

  0%|          | 0/691200.0 [00:00<?, ?it/s]
  6%|▋         | 43200.0/691200.0 [01:07<16:47, 643.15it/s]
  6%|▋         | 43200.0/691200.0 [01:19<16:47, 643.15it/s]
  9%|▉         | 64800.0/691200.0 [02:15<23:11, 450.21it/s]
  9%|▉         | 64800.0/691200.0 [02:29<23:11, 450.21it/s]Traceback (most recent call last):
  File "/home/z5278054/HPCParticleTrackingScript.py", line 158, in <module>
    pset.execute(kernels,  
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particleset/baseparticleset.py", line 450, in execute
    self.kernel.execute(self, endtime=next_time, dt=dt, recovery=recovery, output_file=output_file,
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/kernel/kernelsoa.py", line 235, in execute
    self.remove_deleted(pset, output_file=output_file, endtime=endtime)   # Generalizable version!
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/kernel/kernelsoa.py", line 179, in remove_deleted
    output_file.write(pset, endtime, deleted_only=bool_indices)
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particlefile/baseparticlefile.py", line 276, in write
    if self.maxids > Z[varout].shape[0]:
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/hierarchy.py", line 458, in __getitem__
    raise KeyError(item)
KeyError: 'distance'

  9%|▉         | 64800.0/691200.0 [03:19<32:05, 325.39it/s]

Error 3

WARNING: The zonal halo is located at the east and west of current grid, with a dx = lon[1]-lon[0] between the last nodes of the original grid and the first ones of the halo. In your grid, lon[1]-lon[0] != lon[-1]-lon[-2]. Is the halo computed as you expect?
INFO: Compiled ArraySampleParticleSampleInitialAdvectionRK4periodicBCSampleAgeSampleDistance ==> /scratch/pbs.4103443[76].kman.restech.unsw.edu.au/parcels-15278054/lib55e9f4223600a4830414fcc9902cdcc0_0.so
INFO: Output files are stored in /srv/scratch/z5278054/particle-tracking-sandra/Output/1956-2NearGlobalParticleTrackingOFES.zarr.

  0%|          | 0/691200.0 [00:00<?, ?it/s]
  6%|▋         | 43200.0/691200.0 [01:27<21:45, 496.51it/s]
  6%|▋         | 43200.0/691200.0 [01:39<21:45, 496.51it/s]Traceback (most recent call last):
  File "/home/z5278054/HPCParticleTrackingScript.py", line 158, in <module>
    pset.execute(kernels,  
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particleset/baseparticleset.py", line 450, in execute
    self.kernel.execute(self, endtime=next_time, dt=dt, recovery=recovery, output_file=output_file,
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/kernel/kernelsoa.py", line 235, in execute
    self.remove_deleted(pset, output_file=output_file, endtime=endtime)   # Generalizable version!
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/kernel/kernelsoa.py", line 179, in remove_deleted
    output_file.write(pset, endtime, deleted_only=bool_indices)
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particlefile/baseparticlefile.py", line 284, in write
    Z[varout].vindex[ids, obs] = pset.collection.getvardata(var, indices_to_write)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/indexing.py", line 837, in __setitem__
    self.array.set_coordinate_selection(selection, value, fields=fields)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 1658, in set_coordinate_selection
    self._set_selection(indexer, value, fields=fields)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 1842, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 2137, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 2148, in _chunk_setitem_nosync
    self.chunk_store[ckey] = self._encode_chunk(cdata)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/storage.py", line 1123, in __setitem__
    retry_call(os.replace, (temp_path, file_path), exceptions=(PermissionError,))
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/util.py", line 685, in retry_call
    return callabl(*args, **kwargs)
FileExistsError: [Errno 17] File exists: '/srv/scratch/z5278054/particle-tracking-sandra/Output/1956-2NearGlobalParticleTrackingOFES.zarr/lat/0.3.81c2154ddc264cbc9822d6a22974f5d4.partial' -> '/srv/scratch/z5278054/particle-tracking-sandra/Output/1956-2NearGlobalParticleTrackingOFES.zarr/lat/0.3'

  6%|▋         | 43200.0/691200.0 [02:50<42:44, 252.69it/s]

Error 4

WARNING: The zonal halo is located at the east and west of current grid, with a dx = lon[1]-lon[0] between the last nodes of the original grid and the first ones of the halo. In your grid, lon[1]-lon[0] != lon[-1]-lon[-2]. Is the halo computed as you expect?
INFO: Compiled ArraySampleParticleSampleInitialAdvectionRK4periodicBCSampleAgeSampleDistance ==> /scratch/pbs.4103443[55].kman.restech.unsw.edu.au/parcels-15278054/libb9fe3e8202518d1f140ada2cb138c1c2_0.so
INFO: Output files are stored in /srv/scratch/z5278054/particle-tracking-sandra/Output/1954-1NearGlobalParticleTrackingOFES.zarr.

  0%|          | 0/691200.0 [00:00<?, ?it/s]
  6%|▋         | 43200.0/691200.0 [01:07<16:48, 642.45it/s]
  6%|▋         | 43200.0/691200.0 [01:19<16:48, 642.45it/s]
  9%|▉         | 64800.0/691200.0 [02:15<23:13, 449.56it/s]
  9%|▉         | 64800.0/691200.0 [02:29<23:13, 449.56it/s]
 12%|█▎        | 86400.0/691200.0 [03:23<25:55, 388.69it/s]
 12%|█▎        | 86400.0/691200.0 [03:40<25:55, 388.69it/s]Traceback (most recent call last):
  File "/home/z5278054/HPCParticleTrackingScript.py", line 158, in <module>
    pset.execute(kernels,  
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particleset/baseparticleset.py", line 450, in execute
    self.kernel.execute(self, endtime=next_time, dt=dt, recovery=recovery, output_file=output_file,
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/kernel/kernelsoa.py", line 235, in execute
    self.remove_deleted(pset, output_file=output_file, endtime=endtime)   # Generalizable version!
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/kernel/kernelsoa.py", line 179, in remove_deleted
    output_file.write(pset, endtime, deleted_only=bool_indices)
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particlefile/baseparticlefile.py", line 284, in write
    Z[varout].vindex[ids, obs] = pset.collection.getvardata(var, indices_to_write)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/indexing.py", line 837, in __setitem__
    self.array.set_coordinate_selection(selection, value, fields=fields)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 1658, in set_coordinate_selection
    self._set_selection(indexer, value, fields=fields)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 1842, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 2137, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 2142, in _chunk_setitem_nosync
    cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/core.py", line 2176, in _process_for_setitem
    cdata = self.chunk_store[ckey]
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/storage.py", line 1085, in __getitem__
    return self._fromfile(filepath)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/storage.py", line 1060, in _fromfile
    return f.read()
OSError: [Errno 116] Stale file handle

 12%|█▎        | 86400.0/691200.0 [04:27<31:14, 322.63it/s]

Error 5

WARNING: The zonal halo is located at the east and west of current grid, with a dx = lon[1]-lon[0] between the last nodes of the original grid and the first ones of the halo. In your grid, lon[1]-lon[0] != lon[-1]-lon[-2]. Is the halo computed as you expect?
INFO: Compiled ArraySampleParticleSampleInitialAdvectionRK4periodicBCSampleAgeSampleDistance ==> /scratch/pbs.4103443[77].kman.restech.unsw.edu.au/parcels-15278054/lib45d7423247a2f18964bac9c3a5cf9173_0.so
Traceback (most recent call last):
  File "/home/z5278054/HPCParticleTrackingScript.py", line 158, in <module>
    pset.execute(kernels,  
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particleset/baseparticleset.py", line 408, in execute
    output_file.write(self, _starttime)
  File "/home/z5278054/.local/lib/python3.10/site-packages/parcels/particlefile/baseparticlefile.py", line 268, in write
    ds.to_zarr(self.fname, mode='w')
  File "/home/z5278054/.local/lib/python3.10/site-packages/xarray/core/dataset.py", line 2098, in to_zarr
    return to_zarr(  # type: ignore
  File "/home/z5278054/.local/lib/python3.10/site-packages/xarray/backends/api.py", line 1614, in to_zarr
    zstore = backends.ZarrStore.open_group(
  File "/home/z5278054/.local/lib/python3.10/site-packages/xarray/backends/zarr.py", line 430, in open_group
    zarr_group = zarr.open_group(store, **open_kwargs)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/hierarchy.py", line 1446, in open_group
    init_group(store, overwrite=True, path=path, chunk_store=chunk_store)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/storage.py", line 648, in init_group
    _init_group_metadata(store=store, overwrite=overwrite, path=path,
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/storage.py", line 672, in _init_group_metadata
    rmdir(store, path)
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/storage.py", line 192, in rmdir
    store.rmdir(path)  # type: ignore
  File "/home/z5278054/.local/lib/python3.10/site-packages/zarr/storage.py", line 1233, in rmdir
    shutil.rmtree(dir_path)
  File "/apps/z_install_tree/linux-rocky8-ivybridge/gcc-12.2.0/python-3.10.8-pmtwsrrmcmrs6olvgx5xhepgh7gl5vro/lib/python3.10/shutil.py", line 724, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/apps/z_install_tree/linux-rocky8-ivybridge/gcc-12.2.0/python-3.10.8-pmtwsrrmcmrs6olvgx5xhepgh7gl5vro/lib/python3.10/shutil.py", line 657, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/apps/z_install_tree/linux-rocky8-ivybridge/gcc-12.2.0/python-3.10.8-pmtwsrrmcmrs6olvgx5xhepgh7gl5vro/lib/python3.10/shutil.py", line 680, in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
  File "/apps/z_install_tree/linux-rocky8-ivybridge/gcc-12.2.0/python-3.10.8-pmtwsrrmcmrs6olvgx5xhepgh7gl5vro/lib/python3.10/shutil.py", line 678, in _rmtree_safe_fd
    os.unlink(entry.name, dir_fd=topfd)
FileNotFoundError: [Errno 2] No such file or directory: '.zattrs'

I am a little confused that with some of these errors the progress bar is printed after the error message too, does this imply that the job continued running?

Any help you can offer would be greatly appreciated!

Kind regards,

erikvansebille commented 1 year ago

Hi @DEHewitt, thanks for reporting this. This looks like an issue with the filesystem on your Katana HPC. Not something we can fix on our end.

One workaround solution might be to use the feature that @willirath implemented in #1303 (and which is now part of v2.4.1); instead of writing directly to a file, you can also write to a zarr.storage.Store object, which you can then write to a file later, after the execute has finished.

Let us know if this works!

willirath commented 1 year ago

Hi @DEHewitt, I noticed that in

try:
    os.remove(output_nc_dist)
except OSError:
    pass

you try to remove the Zarr store. But as unlike netCDF-files, a Zarr store is a whole directory, os.remove(<store>) won't work and the store won't be removed. This means that if you have a previous unsuccessful experiment writing to the same store, there will be potentially inconsistent data already in the strore.

I'm not sure this is the root of the problem you see, but it might be worth checking.

willirath commented 1 year ago

Another test which might help pinning down the problem without the Parcels framework on top of it would be to just write a Zarr store in a Python process along the lines of (untested code):

import zarr
from pathlib import Path

localPath = Path("/srv/scratch/z5278054/particle-tracking-sandra/Output/")

test_store = localPath / "test_001.zarr/"

z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
zarr.save(test_store, z)

This will create a Zarr store which is a little simpler than those created by Parcels. So let's go for a more complicated multi-variable structure as well.

import xarray as xr
from dask import array as darr

dataset = xr.Dataset(
    {
        "x": xr.DataArray(darr.random.uniform(size=(1_000, 1_000), chunks=(100, 100)), dims=("i", "j)),
        "y": xr.DataArray(darr.random.uniform(size=(1_000, 1_000), chunks=(100, 100)), dims=("i", "j)),
    },
)

localPath = Path("/srv/scratch/z5278054/particle-tracking-sandra/Output/")

test_store = localPath / "test_002.zarr/"

dataset.to_zarr(test_store);
DEHewitt commented 1 year ago

Thanks so much for your help @erikvansebille and @willirath! The solution in #1303 seems to have done the trick. Also, both tests posted by @willirath worked. Thanks again :)