JuliaDataCubes / YAXArrays.jl

Yet Another XArray-like Julia package
https://juliadatacubes.github.io/YAXArrays.jl/
Other
101 stars 17 forks source link

Saving a cube (~4 GB) takes 12 days #237

Closed dpabon closed 1 year ago

dpabon commented 1 year ago

Hi there,

I'm computing the median per month of the Land Surface Product in the earth system data cube, but when I want to save the result in zarr the progress bar says 12 days. Am I doing something wrong? Below a minimal working example. Thanks in advance for your help.

versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 80 × Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 1 on 80 virtual cores
Environment:
  JULIA_DEPOT_PATH = /Net/Groups/BGI/people/dpabon/bin/julia_packages:/Net/Groups/BGI/people/dpabon/bin/julia_packages:

st

[3c3547ce] DiskArrays v0.3.8 `../../../../../../User/homes/dpabon/DiskArrays.jl#main`
[0a941bbe] Zarr v0.7.3
[90b8fcef] YAXArrayBase v0.6.1
[c21b50f5] YAXArrays v0.4.4

using Pkg

Pkg.activate("/Net/Groups/BGI/people/dpabon/nfdi4earth_oemc")

using YAXArrays
using Zarr
using Statistics
using LinearAlgebra
using BenchmarkTools
using ProgressMeter
using Dates
using DiskArrayTools

function median_by_index(xout, xin; index_list = time_to_index)
    #@show size(xin)
    #@show typeof(xin)
    xout .= NaN
    if !all(isnan, xin)
        for i in eachindex(index_list)
            if !all(isnan, xin[index_list[i]])
                xout[i] = median(filter(!isnan, xin[index_list[i]]))
            end
        end
    end    
end 

function dates_builder(x)
    out = DateTime[]
    for i in eachindex(x)
        push!(out, DateTime(x[i][1], x[i][2]))
    end

    return out
end

earth_dataset = open_dataset("/Net/Groups/BGI/work_1/scratch/s3/esdl-esdc-v2.1.1/esdc-8d-0.083deg-184x270x270-2.1.1.zarr/")

# another way to load the esdc for external people will be using

#=
earth_dataset = esdd()
=#

earth_dataset.land_surface_temperature.properties

lst = earth_dataset.land_surface_temperature[time = (Date("2002-05-21"),Date("2011-12-31"))]

time_to_index = getAxis("time", lst)

time_index = yearmonth.(time_to_index)

new_dates = unique(time_index)

index_in_cube = [findall(==(i), time_index) for i in new_dates]

new_dates

Indims = InDims("Time")

outdims = OutDims(RangeAxis("time", dates_builder(new_dates)))

lst_monthly_high = mapCube(median_by_index, lst, indims = Indims, outdims = outdims; index_list = index_in_cube, showprog = true)

savecube(lst_monthly_high, "/Net/Groups/BGI/scratch/dpabon/lst_monthly_high.zarr", overwrite = true)

mapcube is really fast! It does the computation in ~5 min using a single thread. image

But savecube... following the progress bar, it will take 12 days. image

meggart commented 1 year ago

Definitely a bug. Can you try to write at the same time as you are computing? I.e.

outdims = OutDims(RangeAxis("time", dates_builder(new_dates)), path="my_output.zarr",backend=:zarr)
dpabon commented 1 year ago

Thanks for the prompt reply Fabian, I ran as suggested

outdims = OutDims(RangeAxis("time", dates_builder(new_dates)), path="/Net/Groups/BGI/scratch/dpabon/lst_monthly_high.zarr",backend=:zarr, overwrite=true)

It works like a charm, but I got a new error when using getAxis in the saved cube. I will report it in a new issue.

dpabon commented 1 year ago

This issue was solved by https://github.com/JuliaDataCubes/YAXArrays.jl/commit/5aac674d93bcecf4aa6543306ae11f9933df3cf3

You can use the code below while the new version of YAXArrays is released.

using Pkg

Pkg.add("YAXArrays#main")