YAXArrays seems to download too much data

JuliaDataCubes / YAXArrays.jl

Yet Another XArray-like Julia package

https://juliadatacubes.github.io/YAXArrays.jl/

Other

89 stars 14 forks source link

YAXArrays seems to download too much data #358

Open SimonDanisch opened 6 months ago

SimonDanisch commented 6 months ago

I'm trying the example from the docs:

using Zarr, YAXArrays, Dates, DimensionalData

store = "gs://cmip6/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp585/r1i1p1f1/3hr/tas/gn/v20190710/"
g = open_dataset(zopen(store, consolidated=true))
c = g["tas"]
ct = c[Ti=At(Date("2018-08-01"):Day(10):Date("2050-08-01"))]

in_memory = ct.data[:, :, :]

This takes reaally long and fills up all my RAM (32gb). A few infos:

The selected slice:

Download speed of the julia process

I was expecting it to only download the 328mb, but from the download speed and RAM usage I suspect it's downloading much more data, making it almost impossible to download this part of the dataset... Am I missing something or is this a bug, or just a limitation of the package?

Balinus commented 6 months ago

One thought I have in mind reading the example. I might be wrong though.

Depending on the chunks of the zarr folder on Google, the specific slice asked will still need to download the whole dataset between 2018 and 2050, probably a little bit more for the edges on 2018 and 2050. The whole dataset between 2018 and 2050 is 3.21GB. Is it closer to your measurement?

c = g["tas"]
ct = c[Ti=At(Date("2018-08-01"):Date("2050-08-01"))]
384×192×11689 YAXArray{Float32,3} with dimensions: 
  Dim{:lon} Sampled{Float64} 0.0:0.9375:359.0625 ForwardOrdered Regular Points,
  Dim{:lat} Sampled{Float64} Float64[-89.28422753251364, -88.35700351866494, …, 88.35700351866494, 89.28422753251364] ForwardOrdered Irregular Points,
  Ti Sampled{DateTime} DateTime[2018-08-01T00:00:00, …, 2050-08-01T00:00:00] ForwardOrdered Irregular Points
units: K
name: tas
Total size: 3.21 GB

Balinus commented 6 months ago

Note that I tried to do the same approach in Python and it seems to behave similarly

(in python, I specified the whole timeseries between 2018 and 2050 for simplicity)

import xarray as xr
import zarr

file = 'gs://cmip6/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp585/r1i1p1f1/3hr/tas/gn/v20190710/'
ds = xr.open_dataset(file, engine='zarr')

c = ds.tas
ct = c.sel(time=slice("2018-08-01", "2050-08-01"))
%time ct.values

CPU times: user 3min 19s, sys: 1min 29s, total: 4min 49s
Wall time: 21min 58s
Out[12]:
array([[[216.41226, 216.48257, 216.44742, ..., 216.32828, 216.38297,
         216.40054],