JuliaDataCubes / EarthDataLab.jl

Julia interface for Reading from the Earth System Datacube
http://earthsystemdatacube.net
Other
33 stars 14 forks source link

Help getting MWE for mapCube on exteranlly created Zarr #258

Closed alex-s-gardner closed 2 years ago

alex-s-gardner commented 2 years ago

I've been playing around with ESDL and it looks like a fantastic tool with a lot of potential. We are certainly interested in using it for our projects. For our current project we are producing large public cloud hosted Zarr files and we would like to learn how we can use mapCube to process the cubes. After some faffing about I think I've set up a simple mapCube call to calculate the average along the Z dimension (dim = "mid_date"):

path = ["s3://its-live-data/datacubes/v02/N60W040/ITS_LIVE_vel_EPSG3413_G0120_X-150000_Y-2250000.zarr"];
ds = open_dataset(path[1])

indims = InDims(getAxis("mid_date",ds.v))
outdims = OutDims(getAxis("mid_date",ds.v))
mapCube(mean, ds.v, indims=indims, outdims=outdims)

The call throws "cache misses" and "compressed caches misses" warnings that I tried to google but don't quite understand then exits after throwing a "LoadError: TaskFailedException" error. Any guidance would be greatly appreciated!

felixcremer commented 2 years ago

Hi Alex, sorry for the late response. The cache miss and compressed cache miss warnings are thrown, because you access the data not chunk efficient. The chunks of the ds.v Cube are '(100,100,250)' Therefore we need to load 100 chunks at the same time that is roughly 500 MB for this computation. It will still work, but especially with an externally hosted file, it might have some speed penalty. If you would have a chunking of (10,10,25730) this might be faster and you wouldn't have cache misses.

The error that is thrown in the end, comes from the usage of the mean function directly. The mapCube function expects an inner function that takes at least two arguments where the first argument is the output and second is the input into the function. This is also described here https://esa-esdl.github.io/ESDL.jl/latest/analysis/#A-minimal-example-1.

For your mean example it should be like this:

using Statistics

function mymean(xout, xin)
    xout .= mean(xin)
end
 indims = InDims("mid_date")
 outdims = OutDims()
mapCube(mymean, ds.v; indims, outdims)

Here the OutDims is empty, because this is a reduction along the 'mid_date' dimension and we will get a cube with one less dimension back.

alex-s-gardner commented 2 years ago

@felixcremer thanks a ton.

Chunking: What a thorn in our side this has been. We access the data by reading in full columns of mid_date so your suggestion of [10,10,:] chunking makes complete sense but our dataset is living and we need to continually append data along the mid_date axis as new data is acquired... so we're trying to balance our ability to efficiently append data as x,y slices along mid_date while also maintaining efficient access along mid_date... we're still trying to sort out a better solution so if you've solved this problem we would be very keen to know what you've learned.

"mapCube function expects an inner function that takes at least two arguments where the first argument is the output and second is the input" Thanks for the explanation and pointing to the example. A function that expects an the output as input seems counterintuitive to me. Is this common in Julia? If not I wonder if it would be more intuitive to have the user supply the function with inputs only f(in) ... then f(out,in) could be defined within mapCube, making it hidden to the user.

felixcremer commented 2 years ago

There is the 'mapslices' function which lets you use functions with inputs only and you could also use the 'inplace=false' keyword in the mapCube function when you don't need the output in the inner function. Sometimes it is necessary to also have the output available in the inner function. For example, when you add another dimension and needs to specify how the output is saved or when you have multiple output cubes as described here:https://esa-esdl.github.io/ESDL.jl/latest/analysis/#Calculations-on-multiple-cubes-1

alex-s-gardner commented 2 years ago

@felixcremer and @meggart is ESDL being deprecated in favor of YAXArrays.jl

MartinuzziFrancesco commented 2 years ago

Hi @alex-s-gardner, YAXArrays.jl is being used as a backend to ESDL.jl so it's not being deprecated. The ESDL.jl docs before where referencing some methods that have since been moved to YAXArrays.jl, but we are addressing some docs changes in #266 as detailed in #261. As soon as we deal with #271 we should also be able to host the old docs in the esa-esdl org, so the transition to the new ones will be easier for the user