Open sb4233 opened 1 month ago
thanks @sb4233 - do you mind if we edit the issue title so that it's more focused and descriptive?
Also: can you provide details on the specific ACCESS-OM2
data variables you are trying to calculate against? Are they 2D? 3D? frequency?
THANKS
Btw, @sb4233 note that the cosima_cookbook Python package is deprecated so there won't be any method added to it.
I think the issue is that the data is chunked in times based on how the files are saved in netCDF files (e.g., every 3 months for 0.1 degree model output). So if one needs to do a time-series analysis at every point you need to rechunk in time. I've bumped onto this before and I didn't find a better solution but perhaps I was just naïve!
Btw, you might wanna have a look at the xrft package? Sorry if I misunderstood and this is not something useful.
@sb4233 would you be able to add some code snippets so we can see what you're trying to do?
thanks @sb4233 - do you mind if we edit the issue title so that it's more focused and descriptive?
Also: can you provide details on the specific
ACCESS-OM2
data variables you are trying to calculate against? Are they 2D? 3D? frequency?THANKS
Yeah sure, please go ahead and edit the title. As for the details of my use case -
u, v
and SST
from ACCESS-OM2-01
which is in daily frequency. (time, lat, lon)
. u and v are at a particular level. (time:18250, lat:356, lon:500)
Btw, @sb4233 note that the cosima_cookbook Python package is deprecated so there won't be any method added to it.
I think the issue is that the data is chunked in times based on how the files are saved in netCDF files (e.g., every 3 months for 0.1 degree model output). So if one needs to do a time-series analysis at every point you need to rechunk in time. I've bumped onto this before and I didn't find a better solution but perhaps I was just naïve!
Btw, you might wanna have a look at the xrft package? Sorry if I misunderstood and this is not something useful.
Thanks for the suggestion, seems like xrft can be useful, as it utilizes dask API.
@sb4233 would you be able to add some code snippets so we can see what you're trying to do?
Nothing special, essentially just trying this function below (coherence()
) to calculate the squared magnitude coherence between two data arrays at every grid point (i,j
) and I use joblib.parallel
to parallelly loop over i,j
-
def compute_coherence_slice(data_slice1, data_slice2, i, j, fs, nperseg, window, noverlap, nfft):
f, Cxy = coherence(data_slice1, data_slice2, fs=fs, nperseg=nperseg, window=get_window(window, nperseg), noverlap=noverlap, nfft=nfft)
return Cxy, f
Hey @sb4233, hopefully that new title is representative of your use case ( one shared by others ).
Next steps might be to access daily ACCESS-OM2-01 via intake
catalog, including helpful xarray
kwargs, followed by writing temporary ARD Zarr
collections for u
,v
, and SST
to scratch/vn19
? I'll have a go at this in my spare time tonight or tomorrow - but you or others might get there too.
Look forward to documenting better-practice for these specific use cases with you and others.
@sb4233 - a very useful ref from @dougiesquire et al.
and for storage of any temporary intermediate ARD collections on vn19
let's please use: /scratch/vn19/ard/ACCESS-OM2-01
@sb4233 et al
Here's the kind of overall workflow I'm suggesting each of these specific heuristics could contribute to:
You can see and download our full poster from OMO2024 here: https://go.csiro.au/FwLink/climate_ARD
Hi, I have been trying to do some spectral analysis using variables from
ACCESS-OM2
output. Due to its large, chunked data doing any kind of analysis is very slow. For example, I am calculating the coherence between two variables (usingscipy.signal.coherence
) at every grid point for a specific domain(356x500)
. Now the actual calculation takes only about 3-4 mins (non-chunked). But due to being chunked it takes forever to do the calculation (as the data is being loaded into memory).As a cheap alternative I found that saving the data as early as possible in my calculation (for example, saving the data just after selecting the variable for the region of interest) works (i.e., reducing the number of operations that I need to do while the data is in chunked state). But even in that case it takes about several hours per variable to save it in a netcdf file.
I wanted to know if there is a better way to effectively chunk large datasets so that processing time can be reduced (as much as possible).
Maybe adding a method to
cosima cookbook
which can dynamically chunk large datasets based on the operation that is being performed on it? I am new to this kind of programming so any help would be much appreciated :)