JuliaDataCubes / YAXArrays.jl

Yet Another XArray-like Julia package
https://juliadatacubes.github.io/YAXArrays.jl/
Other
101 stars 18 forks source link

Selection of a variable based on values of another variable that shares one dimension #404

Closed aasdelat closed 2 weeks ago

aasdelat commented 3 months ago

Hello: I have to select data within a geographical window given as longitudes east and west and latitudes north and south. But the spatial coordinate of my data is "values", and I have the variables "msl" (mean sea level pressure), "longitude" and "latitude". I think that the name of the spatial dimension, "values", is a bit confusing, because you can think that they are values of a meteorological variable, but they are not. "values" is a spatial dimension taking values from 1 to the total number of grid points. The dimensions of the msl variable are: "values" and "Ti". The longitude and latitude variables, both have only one dimension: "values". So, each datum in msl can be assigned a unique longitude and latitude via the common dimension: "values". So, how can accomplish this?

felixcremer commented 3 months ago

It seems as if your dimensions and the values of your array are flipped. Could you copy paste the output of printing the YAXArray in the REPL here?

Could you provide an example file that I could look at?

aasdelat commented 3 months ago

Ok. The file is too large, so I send it by we transfer: https://we.tl/t-AC7DBQp7vE

Here is the output of my datset, "ds":

julia> ds
YAXArray Dataset
Shared Axes: 
↓ values Sampled{Int64} 1:338880 ForwardOrdered Regular Points
Variables: 
latitude, 
msl
  ↓ Ti Sampled{DateTime} [1989-01-01T00:00:00, …, 1989-01-31T23:00:00] ForwardOrdered Irregular Points
longitude, 
Properties: Dict{String, Any}("edition" => "1", "source" => "/home/antonio/Documentos/datos/ERA5/ERA5_N320_1h_sl_sfc_NH15S_1978_2023_msl/ERA5_N320_1h_sl_sfc_NH15S_1989-01-01_1989-01-31_msl.grib", "centreDescription" => "European Centre for Medium-Range Weather Forecasts", "centre" => "ecmf", "subCentre" => "0", "Conventions" => "CF-1.7")

As a side note, I have noted that it does not print the name of the last variable ("longitude") in a new line, but at the end of the previous line (I have introduced a carriage return in the text after pasting it here). Perhaps it is an issue for DimensionalData.

felixcremer commented 3 months ago

I had a look at the data and it seems that you have converted the dataset from GRIB to netcdf? How did you do that and might the problem be related to this conversion?

Have you tried opening the original GRIB data with Rasters and the GRIBDatasets package? If that works you could then convert the Raster to a YAXArray to continue working with YAXArrays.

aasdelat commented 3 months ago

Yes, they are converted from Grib. I can open the gribs with GRIBDatasets with no problem. In fact, I converted the grib files to netcdf by opeing the gribs in julia with NCDatasets and saving them with NCDatasets. The structure of the data is correct, perhaps it is not the most common one, but it is correct, and YAXArrays reads and interprets it correctly. The only thing, is that I need a means for selecting a variable (msl) by the values of another variable (latitude and/or longitude) with which it shares one dimension (values). I tested Raster, but it does not correctly interpret the data structure.

aasdelat commented 3 months ago

I have to correct myself: Raters correctly interprets the structure. I tested it a long time ago and it seem not to do it, but now it does!. So, I will give it a try, and convert the raster to YAXArrays, if necessary.

aasdelat commented 3 months ago

I have the answer to my question. In order to select the variable msl depending on the values of latitude (which is not a dimension, but another variable that shares the dimension "values" with the variable msl), I have to do (supposing I want to select data at longitudes below 40º): ds.msl[values=Where(v->ds.longitude[v] < 40)] But it seems to last forever ... but it finally ends with the error:

julia> ds.msl[values=Where(v->ds.longitude[v] < 40)]
ERROR: ArgumentError: Unable to determine chunksize of non-range views.
Stacktrace:
  [1] eachchunk_view(::DiskArrays.Chunked{…}, vv::SubArray{…})
    @ DiskArrays ~/.julia/packages/DiskArrays/MpOpv/src/subarray.jl:29
  [2] eachchunk
    @ ~/.julia/packages/DiskArrays/MpOpv/src/subarray.jl:25 [inlined]
  [3] YAXArray
    @ ~/.julia/packages/YAXArrays/jjg2q/src/Cubes/Cubes.jl:136 [inlined]
  [4] rebuild(A::YAXArray{…}, data::DiskArrays.SubDiskArray{…}, dims::Tuple{…}, refdims::Tuple{}, name::DimensionalData.NoName, metadata::Dict{…})
    @ YAXArrays.Cubes ~/.julia/packages/YAXArrays/jjg2q/src/Cubes/Cubes.jl:200
  [5] rebuild
    @ ~/.julia/packages/DimensionalData/BoJag/src/array/array.jl:85 [inlined]
  [6] rebuildsliced
    @ ~/.julia/packages/DimensionalData/BoJag/src/array/array.jl:100 [inlined]
  [7] rebuildsliced
    @ ~/.julia/packages/DimensionalData/BoJag/src/array/array.jl:99 [inlined]
  [8] view
    @ ~/.julia/packages/DimensionalData/BoJag/src/array/indexing.jl:125 [inlined]
  [9] _dim_view
    @ ~/.julia/packages/DimensionalData/BoJag/src/array/indexing.jl:110 [inlined]
 [10] #view#110
    @ ~/.julia/packages/DimensionalData/BoJag/src/array/indexing.jl:81 [inlined]
 [11] getindex(::YAXArray{Float64, 2, YAXArrayBase.NetCDFVariable{…}, Tuple{…}, Dict{…}}; kwargs::@Kwargs{values::Where{…}})
    @ YAXArrays.Cubes ~/.julia/packages/YAXArrays/jjg2q/src/Cubes/Cubes.jl:487
 [12] top-level scope
    @ REPL[18]:1
Some type information was truncated. Use `show(err)` to see complete types.

However, if I read the data with Rasters, it returns the result instantly and correctly.

felixcremer commented 3 months ago

I guess that you opened the data in Rasters with lazy=false which is the default. And with YAXArrays we hit a very inefficient access pattern which we should fix in YAXArrays. Thanks for the report. You could use YAXArrays.readcubedata(arr) to first load the whole data into memory to have a faster access pattern.

aasdelat commented 3 months ago

Wow! It runs now instantaneously. But, due to my particular data structure, I have to read the variables msl and longitude in different variables:

ds_m = readcubedata(ds.msl)
ds_l = readcubedata(ds.longitude)
# And the selection:
ds_m[values=Where(v->ds_l[v] < 40)]

because, if not:

julia> readcubedata(ds)
ERROR: MethodError: no method matching ndims(::Dataset)

Closest candidates are:
  ndims(::Type{Union{}}, Any...)
   @ Base abstractarray.jl:276
  ndims(::Type{<:Ref})
   @ Base refpointer.jl:96
  ndims(::Type{<:Base.Broadcast.Broadcasted{<:Base.Broadcast.AbstractArrayStyle{N}, Nothing}}) where N<:Integer
   @ Base broadcast.jl:278
  ...

Stacktrace:
 [1] dimnames(x::Dataset)
   @ YAXArrayBase ~/.julia/packages/YAXArrayBase/R6Frw/src/axisarrays/axisinterface.jl:46
 [2] caxes(x::Dataset)
   @ YAXArrays.Cubes ~/.julia/packages/YAXArrays/jjg2q/src/Cubes/Cubes.jl:209
 [3] readcubedata(x::Dataset)
   @ YAXArrays.Cubes ~/.julia/packages/YAXArrays/jjg2q/src/Cubes/Cubes.jl:230
 [4] top-level scope
   @ REPL[26]:1
felixcremer commented 3 months ago

readcubedata was designed to be used on single arrays you have a dataset so it does not fit well but I think we should be able to make it work on Datasets as well.

aasdelat commented 3 months ago

But, is there a way to load a dataset into memory, other then readcubedata, or only readcubedata can load into memory?

aasdelat commented 3 months ago

I will summarize the answer here: The selection of one variable based on the values of another variable with which it shares a dimension, can be done via:

# Read data structure
ds = open_dataset(input_file_name)

# The structure of my data is:
julia> ds
YAXArray Dataset
Shared Axes: 
↓ values Sampled{Int64} 1:338880 ForwardOrdered Regular Points
Variables: 
latitude, 
msl
  ↓ Ti Sampled{DateTime} [1989-01-01T00:00:00, …, 1989-01-31T23:00:00] ForwardOrdered Irregular Points
longitude, 
Properties: Dict{String, Any}("edition" => "1", "source" => "/home/antonio/Documentos/datos/ERA5/ERA5_N320_1h_sl_sfc_NH15S_1978_2023_msl/ERA5_N320_1h_sl_sfc_NH15S_1989-01-01_1989-01-31_msl.grib", "centreDescription" => "European Centre for Medium-Range Weather Forecasts", "centre" => "ecmf", "subCentre" => "0", "Conventions" => "CF-1.7")

# Read data into memory
ds_m = readcubedata(ds.msl)
ds_l = readcubedata(ds.longitude)

# And the selection:
ds_m[values=Where(v->ds_l[v] < 40)]

Where msl is the main variable in which values I am interested, and longitude is the variable with which I have to make the selection. It is necessary to read both variables into memory by means of readcubedata. If they are loaded just by open_dataset, you will get an error and this will not work. On the other hand, Cube cannot be used in place of open_dataset in this case, because of the structure of my data (see the difference between a dataset and a cube).

felixcremer commented 3 months ago

Thanks for the write up. Do you want to open a pull request to add this info to the docs? This would fit well into the how to section.

aasdelat commented 2 months ago

Ok, but I am learning github right now. So I am going to take a while...

felixcremer commented 2 months ago

That is no problem, feel free to ask questions if you feel stuck in the process of opening a PR.

aasdelat commented 2 months ago

Ok, I am doing it.

aasdelat commented 2 months ago

Pull request done! I have proposed a large rearrange of the help, so I think it has to be studied.

aasdelat commented 2 months ago

Hi, @felixcremer , I have opened a pull request. Could you tell me if I have to do anything else or If I have made the pull request correctly?

lazarusA commented 2 weeks ago

what PR was it? should we close this?

aasdelat commented 2 weeks ago

The pull request is https://github.com/JuliaDataCubes/YAXArrays.jl/pull/414 and it has been already accepted and closed. So, yes, this issue can be closed too.