JuliaIO / Zarr.jl

Other
120 stars 24 forks source link

missing chunks to be filled with fill values (when server returns HTTP error 403) #131

Closed Alexander-Barth closed 9 months ago

Alexander-Barth commented 11 months ago

When I try to load the following dataset with Zarr.jl, I get unfortunately an error:

using Zarr

ds = Zarr.zopen("https://s3.waw3-1.cloudferro.com/mdl-arco-time/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-d_202012/timeChunked.zarr")

ds["uo"][:,:,1,1]
# full error below

Yet, the data can be read with python zarr

import zarr

z = zarr.open("https://s3.waw3-1.cloudferro.com/mdl-arco-time/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-d_202012/timeChunked.zarr");

gz = z["uo"]
data = gz[0,0,:,:];

Note that all the data is filled with fill value (1e20) for this chunk. According to the OGC spec, it seems to be ok that not all chunks are present:

There is no need for all chunks to be present within an array store. If a chunk is not present then
it is considered to be in an uninitialized state. An unitialized chunk MUST be treated as if it was
uniformly filled with the value of the “fill_value” field in the array metadata. If the “fill_value” field
is null then the contents of the chunk are undefined.

Can Zarr.jl handle this case too? Are you accepting a PR for this issue?

I am using Zarr v0.9.1.

Thank for this great package, by the way :-)

Full error from Zarr.jl:

ERROR: TaskFailedException
Stacktrace:
  [1] try_yieldto(undo::typeof(Base.ensure_rescheduled))
    @ Base ./task.jl:920
  [2] wait()
    @ Base ./task.jl:984
  [3] wait(c::Base.GenericCondition{ReentrantLock}; first::Bool)
    @ Base ./condition.jl:130
  [4] wait
    @ ./condition.jl:125 [inlined]
  [5] take_buffered(c::Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}})
    @ Base ./channels.jl:456
  [6] take!
    @ ./channels.jl:450 [inlined]
  [7] readblock!(aout::Array{Float32, 4}, z::ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, r::CartesianIndices{4, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}, UnitRange{Int64}, UnitRange{Int64}}})   
    @ Zarr ~/.julia/dev/Zarr/src/ZArray.jl:172
  [8] readblock!(::ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, ::Array{Float32, 4}, ::Base.OneTo{Int64}, ::Vararg{AbstractUnitRange})
    @ Zarr ~/.julia/dev/Zarr/src/ZArray.jl:247
  [9] getindex_disk(::ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, ::Function, ::Vararg{Any})
    @ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:44
 [10] getindex(::ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, ::Function, ::Function, ::Int64, ::Int64)
    @ DiskArrays ~/.julia/dev/DiskArrays/src/diskarray.jl:215
 [11] top-level scope
    @ REPL[229]:1
 [12] top-level scope
    @ ~/.julia/packages/CUDA/35NC6/src/initialization.jl:190

    nested task error: Error connecting to https://s3.waw3-1.cloudferro.com/mdl-arco-time/arco/MEDSEA_MULTIYEAR_PHY_006_004/med-cmcc-cur-rean-d_202012/timeChunked.zarr :<?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><BucketName>mdl-arco-time</BucketName><RequestId>tx0000000000000003c6076-00656d923c-9f064368-default</RequestId><HostId>9f064368-default-waw3-1</HostId></Error>
    Stacktrace:
     [1] error(::String, ::String)
       @ Base ./error.jl:44
     [2] getindex(s::Zarr.HTTPStore, k::String)
       @ Zarr ~/.julia/dev/Zarr/src/Storage/http.jl:24
     [3] getindex
       @ ~/.julia/dev/Zarr/src/Storage/consolidated.jl:27 [inlined]
     [4] getindex
       @ ~/.julia/dev/Zarr/src/Storage/Storage.jl:55 [inlined]
     [5] getindex
       @ ~/.julia/dev/Zarr/src/Storage/Storage.jl:54 [inlined]
     [6] (::Zarr.var"#10#11"{Zarr.ConsolidatedStore{Zarr.HTTPStore}, Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, String})(ii::CartesianIndex{4})
       @ Zarr ~/.julia/dev/Zarr/src/Storage/Storage.jl:121
     [7] (::Base.var"#978#983"{Zarr.var"#10#11"{Zarr.ConsolidatedStore{Zarr.HTTPStore}, Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, String}})(r::Base.RefValue{Any}, args::Tuple{CartesianIndex{4}})                                     
       @ Base ./asyncmap.jl:100
     [8] macro expansion
       @ ./asyncmap.jl:234 [inlined]
     [9] (::Base.var"#994#995"{Base.var"#978#983"{Zarr.var"#10#11"{Zarr.ConsolidatedStore{Zarr.HTTPStore}, Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, String}}, Channel{Any}, Nothing})()
       @ Base ./task.jl:514
    Stacktrace:
      [1] (::Base.var"#988#990")(x::Task)
        @ Base ./asyncmap.jl:177
      [2] foreach(f::Base.var"#988#990", itr::Vector{Any})
        @ Base ./abstractarray.jl:3073
      [3] maptwice(wrapped_f::Function, chnl::Channel{Any}, worker_tasks::Vector{Any}, c
::CartesianIndices{4, NTuple{4, UnitRange{Int64}}})                                    
        @ Base ./asyncmap.jl:177
      [4] wrap_n_exec_twice
        @ ./asyncmap.jl:153 [inlined]
      [5] async_usemap(f::Zarr.var"#10#11"{Zarr.ConsolidatedStore{Zarr.HTTPStore}, Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, String}, c::CartesianIndices
{4, NTuple{4, UnitRange{Int64}}}; ntasks::Int64, batch_size::Nothing)                  
        @ Base ./asyncmap.jl:103
      [6] async_usemap
        @ ./asyncmap.jl:84 [inlined]
      [7] #asyncmap#972
        @ ./asyncmap.jl:81 [inlined]
      [8] asyncmap
        @ ./asyncmap.jl:80 [inlined]
      [9] read_items!
        @ ~/.julia/dev/Zarr/src/Storage/Storage.jl:119 [inlined]
     [10] read_items!
        @ ~/.julia/dev/Zarr/src/Storage/Storage.jl:109 [inlined]
     [11] macro expansion
        @ ~/.julia/dev/Zarr/src/ZArray.jl:165 [inlined]
     [12] (::Zarr.var"#63#66"{Channel{Pair{CartesianIndex{4}, Union{Nothing, Vector{UInt8}}}}, ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, ZArray{Float32, 4, Zarr.BloscCompressor, Zarr.ConsolidatedStore{Zarr.HTTPStore}}, CartesianIndices{4, NTuple{4, UnitRange{Int64}}}})()
        @ Zarr ./task.jl:514
Alexander-Barth commented 9 months ago

It turns out that the server returns the error 403 for missing chunks, while Zarr.jl only looks for 404:

https://github.com/JuliaIO/Zarr.jl/blob/cbcaeadf9d93ec174b850d90ee6db438d962f140/src/Storage/http.jl#L20

In python-zarr any error is ignored and leading to a chunk filled with fill values:

https://github.com/zarr-developers/zarr-python/blob/a81db0782535ba04c32c277102a6457d118a73e8/zarr/storage.py#L1417

Maybe we should so the same, at least for all HTTP errors between 400 and 499 (excluding internal server errors 500, ...).

meggart commented 9 months ago

Fixed by #134