Open rafaqz opened 1 year ago
Something like that happens in GRIBDatasets. The values
property of Variable
(see here) can either be a normal array (usually when it's a coordinate variable), or a AbstractDiskArray
(usually when it's a layer). Indexing on the variable ends up indexing on whatever this array is. Shouldn't it be the responsibility of the implementing package to follow the DiskArray approach or not?
But say we want to use the properties of the Variable, like the CF standards it applies.
Does broadcasting still work as if its a DiskArray? Or do we have to choose one or the other?
(Note that lack of DiskArray compat in NCDatasets.jl will at some stage mean Rasters.jl will switch to using NetCDF.jl as a backend - I already get issues about this)
We are planning on sharing more code in Rasters.jl and YAXarrays.jl - DiskArrays and DiskArrayEngine.jl compat is going to central to that. It would be best if all of these things could work together going forward, or we will have to work on the alternate CF standards implentation in DiskArrayTools.jl and wrap everything with that.
Some variables in CommonDataModel.jl can actually exists only in memory (like the coordinate variable of GRIBDatasets) or a virtual array (whose elements are computed on the fly, using e.g. the coordinate transformation). I don't think that we should assume that all variables are based on disk. But I agree that it will be the most common case and that it should work nicely with DiskArrays.
I still plan to implement DiskArrays in NCDatasets.jl, just want to have know from @meggart if the current behavior (related to unlimited dimensions) can be relied upon or if this will change as you suggested in the issue
https://github.com/meggart/DiskArrays.jl/issues/53
(but you know that issue :-))
I think that the approach in GRIBDatasets is good because only GRIBDatasets knows (and needs to know?) if an array is on disk or in memory (or virtually defined by a function).
For example GRIBDatasets can also attach new methods to CommonDataModel.CFVariable which should not interfere to other packages:
using GRIBDatasets
using CommonDataModel
function foobar(v::CommonDataModel.CFVariable{T,N,TV}) where TV <: GRIBDatasets.AbstractGRIBVariable where {T,N}
return 42
end
filename = joinpath(dirname(pathof(GRIBDatasets)),"..","test","sample-data","era5-levels-members.grib")
ds = GRIBDatasets.Dataset(filename)
foobar(ds["z"])
# 42
But maybe this leads to a lot of boilerplate code than should be better centralized?
I am wondering how DiskArrays work with arrays that are actually in memory. I would think that eachchunk(A)
(where A is an memory Array) would fail, but high-level function like sum(A)
would automatically to the right think thanks to dispatch. Can you confirm?
macros in DiskArrays.jl
Can you point me to the macros in questions? Would this result in a dependency to DiskArrays in CommonDataModel ? Is the DiskArrayEngine.jl code public ? I could not find it. Google gave me just a single result: DiskArray.jl, but maybe DiskArrayEngine.jl it is part of DiskArray.jl).
Currently I dont think GRIBDatasets.jl diskarrays will work on variables. (just from reading the code I may be wrong)
The key problem is forwarding the iteration and broadcast machinery to the parent DiskArray. Otherwise variables will use julia fallbacks of calling getindex
and setindex!
for every pixel, with the expected incredibly slow performance from millions of disk IO operations.
I also saw that you were looking to share a Variable
wrapper object here. The problem (and why I opened this issue) is if it doesn't also use diskarrays broadcasting and chunking then it doesnt really matter if it wraps a DiskArray, it still wont work as a disk array anywhere - its the outer wrapper that controls broadcast.
In Rasters.jl/DimensionalData.jl there is extensive machinery to defer broadcast to the parent object, so it doesnt have to be a DiskArray to work. Thats another option here, but much more code.
For the other problem of if a Variable is not actually holding a disk array: it can still work too, we can handle that behaviour in methods on the Variable object by checking the parent array type where needed.
As for NCDatasets.jl and DiskArrays.jl, saying we can never do a bounds check on setindex!
in a diskarray is a very big ask in Julia, because the array interface actually assumes they happen by default.
But, we could totally guarantee that you can always override setindex!
bounds checks with @inbounds
as thats how the julia array interface is defined already.
Is that acceptable?
As for NCDatasets.jl and DiskArrays.jl, saying we can never do a bounds check on
setindex!
in a diskarray is a very big ask in Julia, because the array interface actually assumes they happen by default.But, we could totally guarantee that you can always override
setindex!
bounds checks with@inbounds
as thats how the julia array interface is defined already.
Yes, I think that should be ok. After all NetCDF.jl handels unlimited dimension too. The bounds check will just be deactivated for unlimited dimension.
As I am not too familiar with DiskArrays.jl, I think it would be best to first finish the integration with NCDatasets.jl to gain a bit of familiarity with it.
I'm trying to catch up the previous discussions about implementing DiskArrays in NCDatasets, but I still need to see the big picture. In the meantime, I tried to make AbstractVariable
a DiskArray, to make GRIBDataset readable by Rasters.jl (#9), but it breaks NCDataset
Ok great, I'm pretty excited to have it working everywhere.
Feel free to make issues or ping me here or on slack if you have any questions at all, as it's a pretty complicated interface (and note the main docs are not building currently either but the dev docs work) .
There are component macros to apply subsets of it or the whole interface, or you can inherit from AbstractDiskArray
to have that done for you if inheritance is possible.
I think we have been talking past each other to some extent, and it would be good to solidify our understanding of the requirements
So I want to attempt to clarify a few outstanding questions that maybe I didn't answer clearly before:
In https://github.com/JuliaGeo/CommonDataModel.jl/issues/8#issuecomment-1508254354
Some variables in CommonDataModel.jl can actually exists only in memory
That's actually fine - disk arrays don't have to be on disk, they just can be. In this case you would just set the chunking to Unchunked
.
I still plan to implement DiskArrays in NCDatasets.jl, just want to have know from @meggart if the current behavior (related to unlimited dimensions) can be relied upon or if this will change as you suggested in the issue
Any object that inherits from AbstractDimArray
can always override setindex!
themseleves manually - to get whatever behavior they like.
You don't lose control of anything, you just get fallback methods that work for disk based arrays. For growing arrays, a custom setindex!
should include updating the chunk size manually. We can add helper methods in DiskArrays.jl to standardise this.
From https://github.com/JuliaGeo/CommonDataModel.jl/issues/8#issuecomment-1508482661
After all NetCDF.jl handels unlimited dimension too.
It actually doesn't handel them, it just allows writes - the chunks are broken afterwards because they are not updated anywhere: https://github.com/JuliaGeo/NetCDF.jl/blob/cad45277eee4897c2d4e7325300a741f61ca98c1/src/NetCDF.jl#L225-L228
So it also needs a custom setindex!
method or a resize!
method to handle this correctly.
Having broken chunks means broadcasting, iteration and other things can do strange things because we assume the chunks along an axis match the size of the array.
Any hope we can make this happen?
The DiskArrays compatiability in lower level GRIBDatasets.jl and NCDatasets.jl objects is not available to most users currently as it wont propagate through the variables defined here.
Is there any progress on the issues mentioned here? https://github.com/meggart/DiskArrays.jl/issues/131
Another thing that makes we wonder what could be the right approach is the comment from @meggart that DiskArrays.jl is tailored with Zarr in-mind and DiskArrays.jl make some choices that are not good when dealing with NetCDF files like loading a complete chunk when only a single element is requested. I guess we would run into the same issue when a variable is just a virtual array (whose element are computed on the fly, like in TIFFDatasets.jl) or in memory. I am wondering if it would make sense to separate the interface (like AbstractDiskArray.jl) from the implementation which can be tailored to Zarr files.
Ok I will have a look at the PR.
Your comments about zarr are not true as far as I know. ArchGDAL.jl and NetCDF.jl have used DiskArray.jl for years now, and HDF5.jl will likely also use it soon: https://github.com/JuliaIO/HDF5.jl/issues/930
If there are specific things we should change to optimise netcdf please make an issue. We should definately fix those small data loads to take advantage of NetCDF caching.
But also be aware you can already add any methods you want for you own types to change the implementation however you like, so I'm not sure that separation into a new package is needed.
Also note there will always be some things that are worse in DiskArrays.jl than in your original implementation, its hard to make a perfect generalisaion. But broadcasts, reductions and iteration will actually work. There are some huge wins from having that.
It would be good if DiskArrays.jl chunking, broadcasting, and iteration capabilities could propagate through any
AbstractVarable
here.They don't have to be
AbstractDiskArray
because there are macros in DiskArrays.jl that can be applied to the types.