Common interface for raster data/DimensionalData.jl

meggart commented 4 years ago

Hello all,

it looks like there are currently several packages trying to implement some xarray or R-raster like functionality in Julia and I think I don't have a complete overview of what is actually out there. The one's I know about are:

https://github.com/rafaqz/DimensionalData.jl @rafaqz
https://github.com/rafaqz/GeoData.jl
https://github.com/mkborregaard/VerySimpleRasters.j @mkborregaardl
https://github.com/esa-esdl/ESDL.jl (my own approach)

I think every one of these approaches has its own focus, but it would be nice if we could either merge/re-use some functionality or try to use a common interface so that e.g. plotting applications can work on different representations of gridded data.

Maybe package authors could comment here on where they see their package fit into the ecosystem.

meggart commented 4 years ago

So let me start with https://github.com/esa-esdl/ESDL.jl . This package was originally created as a toolkit for a single dataset, so started out to be quite limited in the datasets it can handle. Its focus is on an efficient implementation of mapslices and broadcasted mapslices on chunked out-of-core datasets (including automatic parallelization). It is optimized to work on Zarr datasets and the data model is completely compatible with xarray, which means that you can directly read data that has been written by xarray's to_zarr method.

The package introduces an ad-hoc AxisArray-like type as well that has named axes and values along these axes. However, I would be very happy to exchange this backend with whatever we might come up as a common interface. My main requirement would be that it is possible for this array to lazily hold the data, since all my datasets are larger than memory. The package also comes with some rudimentary plotting functionality (see docs), but I would really like to interface some other solution, since maintenance is quite time-demanding.

For a quick feeling how the package works, have a look at e.g. these notebooks https://github.com/esa-esdl/ESDLPaperCode.jl/tree/master/notebooks .

(Pinging @Alexander-Barth )as well, who might be interested

visr commented 4 years ago

For some discussion on a raster interface see also https://github.com/JuliaGeo/GeoInterface.jl/issues/16.

But I think this is more about labelled arrays in julia, which is also a good discussion to have. In the basis I think this should not be geo-specific, since it is something of use for a much larger audience. Hence also the split between DimensionalData and GeoData.

Another option is IndexedDims and NamedDims as mentioned here: https://discourse.julialang.org/t/status-of-axisarrays-jl/28682/5. Perhaps it is best to participate in that thread as well, for wider visibility.

rafaqz commented 4 years ago

Thanks for starting this @meggart. Im also keen to move towards integrating these efforts.

ESDL.jl looks interesting, I think we've done a lot of similar work. I'll have a thorough look at it during the week. To me there are two key interfaces to agree apon. As @visr is suggesting, there are general named axis/dimension data for use across many julia domains, but there are also container types specific to the geospatial domain. The reason DimensionalData and GeoData are separate is mostly to separate these discussuions.

Hopefully we can treat DimensionalData.jl as a replaceable building block for building geospatial tools, which shouldnt have to worry about it much besides defining dimensions during object construction.

Agreeing on common spatial objects and interfaces is probably more specific to this space. I've put forward the array/stack/series components, based on my needs for modelling packages (see the GeoData.jl readme). But we can certainly restructure that to fit the cube concept. Im also focussed on making everything potentially disk-based and lazy, with the same synax as eager versions.

meggart commented 4 years ago

Thanks to both of you for your replies. You are right, I am mixing two separate issues here, and since our packages touch both of these issues I thought I would mention both. I will focus on the more Goe-related one for now.

Agreeing on common spatial objects and interfaces is probably more specific to this space. I've put forward the array/stack/series components, based on my needs for modelling packages (see the GeoData.jl readme). But we can certainly restructure that to fit the cube concept. Im also focussed on making everything potentially disk-based and lazy, with the same synax as eager versions.

I have looked at GeoData.jl befire and I really like the interface. The only fundamental design decision that causes me headaches is that it AbstractGeoArray subtypes AbstractArray. As you know Julia does a lot of magic by implementing many fallback methods for AbstractArrays which makes it so easy to define your own Array types. However, if your array is mapped to a compressed on-disk array with a non-standard chunking these fallbacks will be very inefficient. Just accidentally calling the fallback show method can freeze your terminal when you have opened a remote dataset on S3, because the method assumes fast random access on your data. The same is true for the default implementations of broadcasting, map, etc.

So I have decided not to subtype AbstractArray for now. However, I would really like to experiment with returning AbstractGeoArrays when getindex is called on one of my cubes, or to represent the data as DimensionalData inside a mapslices operation, as is currently supported for AxisArrays.

BTW, I think it might be a good idea to discuss in person, so maybe we could have some kind of JuliaGeo raster teleconference open for interested people to show and discuss the different approaches and move forward with a common interface.

rafaqz commented 4 years ago

That is a good point. I've written DimensionalData.jl with this in mind, so that you don't really need to inherit from AbstractDimensionalArray, you just need dims and rebuild methods (refdims also helps). AbstractGeoStack uses dim indexing but its not <: AbstractArray

A teleconference is a good idea, @juliohm and a few others in the was also interested in that too over at the VerySimpleRaster thread. I'm in Australian Eastern standard time but 7am-2am my time is a tolerable window most days.

Edit: I'm also using disk based backends so avoiding these default access methods will be really important, but I haven't really though it through yet or hit any practical problems. But its relevent to the choice made in DD to rely on dimensions for functionality and dispatch when possible instead of the array types, something we are simultaneously discussing in this thread: https://github.com/JuliaCollections/AxisArraysFuture/issues/1#issuecomment-531338896

juliohm commented 4 years ago

Thank you @rafaqz for pinging. I am in BRT time, and can adjust my agenda accordingly after you decide a meeting hour. We definitely need to set a video call to address this challenge. It is too big to be discussed on GitHub threads only.

meggart commented 4 years ago

Oh dear, time-wise I am in the middle (CET). Maybe we could try to do 8 a.m. BRT, which would correspond be 9 pm Australian Eastern and it would be around noon for Europeans. Are there any suggestions for a Date? I would be available on normal weekdays.

In order to collect this information in a single place I have started a Google doc to collect availabilities and possible topics and might serve as an agenda in the end. https://docs.google.com/document/d/1ccaSltPDb5n-bLNgWyrotsWXz2n7ckMggOIPejy_GvQ/edit?usp=sharing Please feel free to edit as you like and share with interested people. Let me know if the google docs does not work for some reason (no google account and don't want one etc...) and suggest where to do this instead.

meggart commented 4 years ago

hat is a good point. I've written DimensionalData.jl with this in mind, so that you don't really need to inherit from AbstractDimensionalArray, you just need dims and rebuild methods (refdims also helps). AbstractGeoStack uses dim indexing but its not <: AbstractArray

This is great and it looks like your approach is indeed very composable. I will look into the AbstractGeoStack implementation, thanks for the pointer.

Alexander-Barth commented 4 years ago

Thanks a lot for your great initiative, Fabian!

In my domain, grids are not necessarily aligned in the longitude and latitude directions (for example a satellite swath, https://docserver.gesdisc.eosdis.nasa.gov/public/project/Images/OMNO2_003.png or a model whose vertical grid depends on time). Should such raster data also be supported?

In the most general case, if a dataset has e.g. 4 dimensions, the arrays representing longitude, latitude, depth (or elevation) and time can also have 4 dimensions.

I think this could still be represented efficiently by having an special array type which is virtually is a 4d array but stores only a vector of the data and defines the getindex function appropriately, for example if the longitude varies only along the 1st dimension, one could have:

Base.getindex(x::Virtual4DArray,i,j,k,n) = x.small_vector[i]

Similarly, the value could just be computed on the fly for specific projection using the corresponding formulas (e.g. http://mathworld.wolfram.com/MercatorProjection.html) so that the storage would actually just be a handful of parameters.

A georeferenced dataset would just be a collection including the actual data and arrays representing lon, lat, elevation and time (or a subset of these). Some data sets might have even more dimensions (e.g. frequency for hyperspectral satellite data, ensemble member for Monte Carlo simulation).

A possible interface could be:

value =  georeferenced_dataset.data[2,3,4] # value at the indices 2,3,4
lon = georeferenced_dataset.lon[2,3,4] # longitude of the value at the indices 2,3,4
lat = georeferenced_dataset.lat[2,3,4] # latitude of the value at the indices 2,3,4
frequency = georeferenced_dataset.frequency[2,3,4] # implemented using getproperty

Simply indexing such a data set would return a georeferenced subset.

slice_of_georeferenced_dataset =  georeferenced_dataset[:,3,4]

Multiple dispatch in julia allows us to implement optimized function, for some common special cases, for example extract all data withing a geographic bounding box if the dataset is nicely aligned along longitude and latitude (or in a given projection where the indices can be directly computed).

I am also in CET time zone.

Alexander-Barth commented 4 years ago

I just came across the OceanGrids.jl package from @briochemc which looks quite interesting.

The tests give an idea about the interface: https://github.com/briochemc/OceanGrids.jl/blob/master/test/runtests.jl

rafaqz commented 4 years ago

@Alexander-Barth. I was planning to eventually build that kind of indexing into the DimensionalData.jl dimensions - the n dimensional arrays would be stored in the Lat/Lon/Vert/Time etc structs rather than as fields on the array. A trait would then control which dimensions needed multiple indices - Ie Lat/Lon might be matrices while Time could still a vector if it doesn't affect the lat/lon coordinates. It should be flexible to represent any configuration over any dimensions.

GeoData.jl already does geo-referenced subsetting, you can easily extract an area between some lat/lon etc (using Between()) and get another GeoArray result. But it will of course be a bit more work when the dims are holding 4d arrays.

The NCDatasets implementation works for simple (axes as vectors) projections: https://github.com/rafaqz/GeoData.jl/blob/master/src/sources/ncdatasets.jl

Alexander-Barth commented 4 years ago

@rafaqz This sounds quite interesting! Can you give an example what the code would look from a user's perspective (once it is implemented, if it is not already) to get for example lon/lat/time of a given index of the georeferenced raster assuming that the 1st and 2nd dimensions are not aligned with the longitude and the latitude direction? Would it the be same as I proposed above?

rafaqz commented 4 years ago

You can already load matrices into dims instead of a vector, but there is no method for getting the lat at a particular index, so you would have to do something like val(dims(a, Lat))[x, y, z].

But we can add a method that does that, like somemethod(a, Lat, x, y, z) if you have an idea what to call it. x, y, z could be indices 1, 7, 22 etc or Y(7), X(1), Time(At(DateTime(2005,1,10))).

Edit: It seems a little more complicated than this now I've written it out. Would the dims be called X, Y when they are not exactly Lat/Lon aligned? or still Lat/Lon?. We may need to store a second list of dims for mapping to. I actually never use these formats so I'm not really across how they are used.

rafaqz commented 4 years ago

@meggart 9pm is fine by me. Early next week (21st/22nd AEST is good), I will be travelling for a week or so after that but might be able to find time.

visr commented 4 years ago

Would the dims be called X, Y when they are not exactly Lat/Lon aligned? or still Lat/Lon?. We may need to store a second list of dims for mapping to.

Yeah, to give you an example of a NetCDF where this is the case:

dimensions:
    x = 424 ;
    y = 412 ;
    time = UNLIMITED ; // (444 currently)
variables:
    double lon(y, x) ;
        lon:standard_name = "longitude" ;
        lon:long_name = "longitude" ;
        lon:units = "degrees_east" ;
        lon:_CoordinateAxisType = "Lon" ;
    double lat(y, x) ;
        lat:standard_name = "latitude" ;
        lat:long_name = "latitude" ;
        lat:units = "degrees_north" ;
        lat:_CoordinateAxisType = "Lat" ;
    double time(time) ;
        time:standard_name = "time" ;
        time:long_name = "time" ;
        time:bounds = "time_bnds" ;
        time:units = "day as %Y%m%d.%f" ;
        time:calendar = "standard" ;

So x, y, time are the orthogonal dimensions. And lon and lat are variables that each vary with both x and y.

Alexander-Barth commented 4 years ago

Thanks @visr for sharing this example. The names of the dimensions (x and y) are indeed arbitrary. The important part is that the data variable re-uses the dimensions names, for example:

    double temperature(time, y, x) ;

in NCDatasets.jl I use the recently added function NCDatasets.coord to find the coordinate by the standard name (or by units). It is required that the dimensions of the coordinate variable (lon, lat, ...) is a subset of the dimensions the data variable (e.g. temperature) as a single NetCDF file can have multiple variables representing longitude, latitude or time (which is the case for some model output where not all variables are defined at the same location, I can share an example to make this point more concret if you want).

rafaqz commented 4 years ago

Thanks that example really helps. Do either of you have some demo files you can link to for testing? I think GeoData.jl will break if you try to load something like that currently. But it shouldn't be too much work to handle it.

I added a few points about formats and projections to the discussion agenda.

Alexander-Barth commented 4 years ago

Sure, here is an example from the ROMS model. The lon/lat for the variable e.g. zeta are lon_rho and lat_rho while e.g. for ubar, lon_u and lat_u. Such grid staggering is used by almost all ocean and meteorological models nowadays. I just released a new version of NCDatasets if you want to reuse the coord function.

Alexander-Barth commented 4 years ago

But we can add a method that does that, like somemethod(a, Lat, x, y, z) if you have an idea what to call it. x, y, z could be indices 1, 7, 22 etc or Y(7), X(1), Time(At(DateTime(2005,1,10))).

While rereading you comment, are Xand Y types defined by the GeoData.jl or types that the user has to define?

rafaqz commented 4 years ago

Thanks for that, I'll look at the coord function too. Will there be a way of standardising this across other datasets, say GDAL and custom HDF5 formats? Or is this kind of detail mostly only in netcdf?

DimensionalData.jl defines X, Y, Z and Time, GeoData.jl defines extends this with Lat, Lon, and Vert. Dim{:x} is the verbose fallback.

Package extenders can add more with @dim MyDim "My great dimension" but users shouldn't have to very often.

meggart commented 4 years ago

@meggart 9pm is fine by me. Early next week (21st/22nd AEST is good), I will be travelling for a week or so after that but might be able to find time.

Regarding the call time I would suggest next Monday Sep 23 at 11am UTC (which is 1pm CEST, 8am Sao Paulo, 9pm Melbourne). I hope this fits everyone.

juliohm commented 4 years ago

I support the proposed time for the meeting. How we can organize the agenda to optimize the time?

rafaqz commented 4 years ago

It works for me too. I've just been editing the agenda doc a little with ideas.

https://docs.google.com/document/d/1ccaSltPDb5n-bLNgWyrotsWXz2n7ckMggOIPejy_GvQ/edit?usp=sharing

I'm also wondering if @evetion was aware of this seeing he has also written a raster package and has good ideas for using complex projections, and @mkborregaard as his name was misspelled above

Alexander-Barth commented 4 years ago

Unfortunately, I have a lecture to give Monday 1pm (introduction to Julia btw ;-) )

meggart commented 4 years ago

Unfortunately, I have a lecture to give Monday 1pm (introduction to Julia btw ;-) )

I'm also wondering if @evetion was aware of this seeing he has also written a raster package and has good ideas for using complex projections, and @mkborregaard as his name was misspelled above

Given the miss-spelling and that @Alexander-Barth is not available, I would suggest to move to a later day, it would be important to give @evetion and @mkborregaard a chance to participate who might not have seen this thread yet. Do we need a doodle?

Regarding optimizing the agenda, I am not sure. Would anyone be willing to moderater the meeting and try to keep stuff focused? Maybe @visr or @juliohm who are not directly involved in writing raster packages?

juliohm commented 4 years ago

I think it is a good idea to reschedule the meeting to include everyone. I can also help moderate as suggested. My raster type is in GeoStatsBase.jl called RegularGridData but definitely not general enough to handle the features discussed here.

I also have some interesting requirements from a GeoStats perspective that I would like to share.

On Sun, Sep 22, 2019, 16:13 Fabian Gans notifications@github.com wrote:

Unfortunately, I have a lecture to give Monday 1pm (introduction to Julia btw ;-) )

I'm also wondering if @evetion https://github.com/evetion was aware of this seeing he has also written a raster package and has good ideas for using complex projections, and @mkborregaard https://github.com/mkborregaard as his name was misspelled above

Given the miss-spelling and that @Alexander-Barth https://github.com/Alexander-Barth is not available, I would suggest to move to a later day, it would be important to give @evetion https://github.com/evetion and @mkborregaard https://github.com/mkborregaard a chance to participate who might not have seen this thread yet. Do we need a doodle?

Regarding optimizing the agenda, I am not sure. Would anyone be willing to moderater the meeting and try to keep stuff focused? Maybe @visr https://github.com/visr or @juliohm https://github.com/juliohm who are not directly involved in writing raster packages?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaGeo/meta/issues/6?email_source=notifications&email_token=AAZQW3JCUOZWFIPXQHKIQSLQK67TZA5CNFSM4IWM5CJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JM53Y#issuecomment-533909231, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3PEJDIFNGFOASRVDG3QK67TZANCNFSM4IWM5CJA .

meggart commented 4 years ago

During the last weeks I spent some time browsing through the different approaches of named arrays and dimensions and did some experiments and tinkering. As I mentioned before, the goal and focus of each package as well as the implementation details are quite different, while the concept of named dimensions that carry values is the recurrent theme in them.

The result of this tinkering is a more or less concrete proposal for an interface that would let the different approaches talk to each other. I tried to quickly draft this in this gist. The basic idea is that, no matter if your grid is regular or not, wen want to be able to assign some coordinates to every entry in a multidimensional array.

The gist defines a set of THTT traits for hasgrid which play nicely with DimensionalData but should also be general enough to deal with irregular grids as described by @Alexander-Barth . For my own data types it would be very easy to support this interface as well and I would be very interested in a showcase where one of my operations can run on a multifile NCDataset through this interface.

I think it would be nice to try to factor out a few additional array traits that might be important in our domain and try to export them somewhere. Here are a few suggestions:

One aspect that is important to me is to allow different interpretations of grids. In the climate modelling world it is quite common that a value of a grid cell (Like air temperature) corresponds to the value exactly at the center of the grid cell. However, most (aggregated) remote sensing products should be interpreted as an average over an area defined by the cell. The same goes for the time dimension: Is my value representing an average/aggregate over a time span or is it a measurement taken at the given time stamp. IMO this should become a trait of a Dimension and might affect how we handle indexing functions like Nearest, Between and At as well as regridding/aggregations
Another trait might be if random access into an array is fast or not, so that Base mapslices and other simple arithmetic is only applied on small in-memory arrays
For the out-of-core arrays a trait for chunks as proposed in https://github.com/meggart/ChunkedArrayBase.jl would be essential to me
Finally, to make the bridge to GeoData, a method like is_geo_x and is_geo_y on a AbstractDimension to extract the dimensions that should be interpreted as lon and lat as well as an interface to check CRSes as discussed in other threads might be added to this set of interfaces as well.

I would be happy to work towards such a set of traits to have something more concrete to discuss during the upcoming teleconference. Shall we actually try to set a new date for this?

rafaqz commented 4 years ago

Those are great proposals. I was just thinking about chunking today in relation to Dask/Dagger and how to generalise it over multiple source file types. ChunkedArrayBase seems like a good solution.

The nature of grid cells is also a good point. Another trait marking the nature of the dimension is needed. Near and Between currently use very simple methods just to prove the idea is reasonable, but if the coordinate/time etc represents the middle of edge of a cell their result should be different. Combining the grid regularity/irregularity with the cell type also makes sense.

But instead of traits of the array, I think RegularGrid{Center} etc should be traits of the dimensions. Forward and Reverse traits currently mark if a dimension is in reverse order and allow different dispatch in the selectors (ie to use rev=true in searchsortedfirst), and we could add another field for the grid type that works in a similar way. There is often the case where the Lat/Lon grid is irregular but time or elevation is regular, and the traits required to describe all the combinations will get out of hand if they are on the array.

Traits for load cost could be useful as well, and these could be on the array.

Lets set another date for this soon. I can be free any weekday evenings next week to fit the same schedule agreed for the last attempt.

juliohm commented 4 years ago

The basic idea is that, no matter if your grid is regular or not, wen want to be able to assign some coordinates to every entry in a multidimensional array.

Exactly. I would also try to follow classical naming for the types a la VTK. For example irregular spatial types can be further broke down into structured (when we can still index with i,j,k), which is the case handled here, and unstructured (general meshes for which the notion of up-down-left-right doesn't make sense).

In the climate modelling world it is quite common that a value of a grid cell (Like air temperature) corresponds to the value exactly at the center of the grid cell. However, most (aggregated) remote sensing products should be interpreted as an average over an area defined by the cell.

This is one of the requirements I had in mind from a geostatistics perspective. Ideally we should have a trait system that can express the notion of volume of each element in the grid (rectangle elements): https://github.com/juliohm/GeoStats.jl/issues/40 This volume of the i-th element volume(grid, i) is a general trait that could be used for non-regular spatial data types. My plan in GeoStats.jl is to add this notion of volume in a next minor release, and in the case of regular grid, we can just save a single constant value for all elements in the grid type, or to make the code more clean, use FillArrays.jl. At the moment the plot recipes in GeoStatsBase.jl are misleading and plot the regular grids as if they were corner-based. We should move away from this idea because volumes of each element can be used to improve the estimates spatially (Kriging methods).

Finally, to make the bridge to GeoData, a method like is_geo_x and is_geo_y on a AbstractDimension to extract the dimensions that should be interpreted as lon and lat as well as an interface to check

I would like to generalize this to N-dimensional spaces. Just keep in mind that the majority of use cases is 2D in a map projection, with lat/lon values and so on, but these are not all use cases for a regular grid. In our field, modeling the subsurface is quite common, and we need 3D grids with properties of the rock.

Shall we actually try to set a new date for this?

For sure! I am in! We need to set a date and time. I am traveling over the weekend with limited access to the internet, but that same time proposal we attempted last time works for me weekdays.

mkborregaard commented 4 years ago

I'm tagging @RichardReeve who is also very interested in these designs.

Alexander-Barth commented 4 years ago

Dear Fabian, thanks for your thoughtfully write-up! Concerning point 1, I agree that this can be important, especially for the time dimension where the model output are either averaged over e.g. a day or the instantaneous fields are saved. The models I work with (finite volume) are actually also averaged over a model grid cell. In the usual case, these are simply rectangles, some researchers got pretty wild in introducing new kinds of grids as also @juliohm described. Having a call like volume(grid,i) returning a special type appropriate for the geometry (rectangle, hexagon,...) seems like a good idea.

Do I understand it correctly, that you foresee that the use have to work with two types: the actual data array and the data type representing the grid? So if the user slices data array, he also also to slice the grid correspondingly? It would be nice if the user has to deal only with one object.

briochemc commented 4 years ago

I just came across the OceanGrids.jl package from @briochemc which looks quite interesting.

The tests give an idea about the interface: https://github.com/briochemc/OceanGrids.jl/blob/master/test/runtests.jl

Sorry I was on holidays for the past month and missed this whole conversation. I guess I missed the teleconference but I would love to hear what came out of it. FYI OceanGrids.jl was taylored to my needs for the AIBECS.jl package for modelling ocean tracers. All this conversation may be a little over my head but I would love to educate myself and discuss this more with all of you, so please include me in your next discussion(s). (I am also on Australian eastern standard time.)

yeesian commented 4 years ago

Really excited to see the conversation here. From my very limited experience, given the difficulty of scheduling such meetings, it will be good to have a document in advance that hashes out most of the considerations, and

scope out what is not yet well-understood (and what needs to be done to arrive at a decision for those), and
narrow in on some key decisions to be made regarding the interface.

It will be more meaningful to spend that time figuring out whether the given proposal works for them and what needs to be done for it to be possible, and that'll require some work from all interested parties upfront. It looks like it'll likely be based on Fabian's comment, so I suggest turning that into a wiki page or shared document that people can comment/annotate/etc.

rafaqz commented 4 years ago

Those are good points @yeesian

We have this document already, feel free to add to it (and edit it!): https://docs.google.com/document/d/1ccaSltPDb5n-bLNgWyrotsWXz2n7ckMggOIPejy_GvQ/edit?usp=sharing

We probably need a doodle poll to or similar organise this as well now

meggart commented 4 years ago

Do I understand it correctly, that you foresee that the use have to work with two types: the actual data array and the data type representing the grid? So if the user slices data array, he also also to slice the grid correspondingly? It would be nice if the user has to deal only with one object.

No, this is not the plan. I just wanted to suggest an interface that different raster implementations can implement so that a user can query the coordinates of any raster data type using the same function. The slicing would of course happen on the level of the raster tyoe and calling coordinates on the sliced array would then only return the sliced coordinates.

I think once we have a common interface for the variety of grid types out there, it will be much simpler to compare the different approaches and for users to switch from one to the other.

meggart commented 4 years ago

@rafazq I agree with your comments. Yes, Center should be a property of the dimension, not the array, very good point.

juliohm commented 4 years ago

Do I understand it correctly, that you foresee that the use have to work with two types: the actual data array and the data type representing the grid? So if the user slices data array, he also also to slice the grid correspondingly? It would be nice if the user has to deal only with one object.

What I done in GeoStatsBase.jl with all the spatial data types is to have a DataView type that just stores the indices into the original data:

julia> using  GeoStats

juila> props = Dict(:temperature => rand(10,10))

julia> d = RegularGridData{Float64}(props)
10×10 RegularGridData{Float64,2}
  origin:  (0.0, 0.0)
  spacing: (1.0, 1.0)
  variables
    └─temperature (Float64)

julia> view(d, [1,5,6])
3 DataView{Float64,2}
  variables
    └─temperature (Float64)

What we can also do with regular grid data is define slices which preserve the "regularity" of the grid. This way algorithms that use the slice of a regular grid can be optimized. So far I only use views in my algorithms, but we should definitely consider slices for regular grids.

EDIT: where a slice is just a call view(d, x1range, x2range, ..., xnrange) with n ranges (or integers) in a n-dimensional grid.

Alexander-Barth commented 4 years ago

No, this is not the plan. I just wanted to suggest an interface that different raster implementations can implement so that a user can query the coordinates of any raster data type using the same function. The slicing would of course happen on the level of the raster tyoe and calling coordinates on the sliced array would then only return the sliced coordinates.

I think once we have a common interface for the variety of grid types out there, it will be much simpler to compare the different approaches and for users to switch from one to the other.

OK, thanks. I was not sure If this is the interface that the users needs to works with or the package developer. In the case of a rotated grid, dimension names are rather identifiers (e.g. xi_rho, eta_rho) which are not directly related to a coordinate like longitude and latitude. In your proposed interface, how would one find e.g. the longitude? Would it always be by convention the first dimension? Or would the dimension name be equal to "Longitude" (independently of the dimension name in a e.g. NetCDF file)

rafaqz commented 4 years ago

@Alexander-Barth meggarts proposal is extending the existing DimensionalData.jl / GeoData.jl

Edit: apologies it intends to extend these and NamedDims or AxisArrays etc. But it doesn't specify dimension order, that occurs in the dimension packages, and aren't fixed.

In GeoData dimensions are currently the types Lat and Lon Vert and Time if they are something like lat lattidude lon long longitude etc in the dimesion names of the file. If they have another name it will be Dim{:othername}. But this behaviour is definitely up for debate.

There is no set dimension order in any of these packages: they all hold tuple of dimensions in the order of the underlying data.

In GeoData.jl the main task of implementations that load data sources into an AbstractGeoArray is is getting the order and coords/time of the dimensions right, and with meggarts additions, specifying the grid type and cell type as well. DimensionalData.jl then lets you index the dimensions in any order and sorts out the reordering.

Hopefully the user rarely has to think about any of this and things just work. But they can if they want to and it should be fairly intuitive.

Alexander-Barth commented 4 years ago

So in case of a rotated grid, the dimension would be e.g. Dim{:xi_rho}, right (for the example ocean_his.nc)? How would the user/library finds the longitude of a given grid point using the proposed interface? I do not want to push "rotated grid", if you think it is out-of-scope then it is fine for me. But if the interface should support those, then it is not so clear to me how they are handled.

rafaqz commented 4 years ago

Yes that could be the dim name.

I think the idea is we put the trait system in place and we can fill out later to match all of these different grid types, so talking about it now is good to make sure they can be covered, but they might not get implemented until someone needs them enough to do it.

But once we have the grid trait, say it's RotatedGrid <: IrregularGrid, we can dispatch to methods that handle rotated grid, in DimensionalData it might be sel2indices(::RotatedGrid, dim, selector) = do_something where selector is At(x), Near(x) etc.

At can hold anything, normally its one lat or lon lookup value, but it could be 2 or more for matrix lookups.

The RotatedGrid type could have a field holding an AffineMap from CoordinateTransformations, which transforms the coordinates. I'm not sure exactly how to handle coordinate transformations shared by multiple dimensions, but not by all (eg not by time). We may need to use the grid traits earlier in dispatch when we have the whole tuple of dims, and just push the RegularGrid dim like time into the sel2indices(::Regular..) methods, and work on the others together in RotatedGrid methods.

That could happen without too much difficulty in DimensionalData once it's dispatching on the RegularGrid trait, it's all recursive so you can intervene anywhere fairly easily. It starts with a tuple of the existing dimensions and a tuple of the indexing dimension, and works over them recursively until you have regular indices to pass to the parent array.

So the user maybe just do:

A[Dim{:xi_rho}(At(whatever_it_needs)), Dim{:eta_rho}(At(xxx)), Time(At(DateTIme360Day(2015, 2, 2))]

Or something. It might be more complicated than that, but we can probably use dispatch to resolve it.

Edit: thinking about this more - if we limit an array to a maximum of one irregular grid type on N dims, with M regular grids on the remaining dims, which is probably realistic (is it?), we can do:

A[Irreg(At(144.3), At(-37.5)), Time(At(DateTIme360Day(2015, 2, 2))]

or order doesn't matter:

A[Irreg(Lat(At(-37.5)), Lon(At(144.3))), Time(At(DateTIme360Day(2015, 2, 2))]

meggart commented 4 years ago

I think the idea is we put the trait system in place and we can fill out later to match all of these different grid types, so talking about it now is good to make sure they can be covered, but they might not get implemented until someone needs them enough to do it.

Thanks @rafaqz this was exactly the plan. Try to flesh out which is generic enough to fulfill future use cases and simple enough to be integrated for existing schemes in a few lines of code. One question would be where an interface like this could live. Since it ideally is not only geo-specific, probably a new small package like DimensionalArrayTraits or similar might make sense. I would be ok with adding the traits to DimensionalData as well, but maybe uptake of the interface by other packages is larger when we keep the dependency minimal.

Regarding a new telecon date, I am currently travelling, but would be available again from next week. I will create a doodle once I am back if nobody beats me to it.

rafaqz commented 4 years ago

I've started implementing the types in DD for simplicity but I agree we should move them out afterwards - maybe DimensionTraits.jl instead? it will contain things that could be used by non-array data and it's a shorter name.

Edit: One question is how to deal with the span of a grid cell. I.e. it might be a Month, not an exact amount of time. The cell span is also useful for bounds() to calculate the actual edges of the dataset if the coordinates are specifying the start or center of cells. IrregrularGrid and CategoricalGrid wont have a cell span.

I've written it up like this:

struct Center{T} <: CoordType{T} 
    span::T
end

In the branch of DD exploring grid traits: https://github.com/rafaqz/DimensionalData.jl/blob/grid/src/grid.jl

mkborregaard commented 4 years ago

Hi guys, thanks for including me in this conversation and sorry for being late to the party. So to answer the original question, my goal for VerySimpleRasters.jl was to make a really light pure-julia package for doing most of the stuff I need to do with rasters.

So, I'm not a geographer, I'm an ecologist, and 99% of all my uses of rasters involve accessing larger-than-memory rasters, extracting values at certain points, aggregating/resampling, cropping/masking, window operations, cost-distance-landscape calculations etc. Time series too. These are all things you can easily do with generic (e.g. image) operations on julia arrays. So the package reads and writes mmapped arrays and opens for operations on the matrix with a simple transformation from coordinates to getindex. It also has like zero deps. I thought of it as a bit of an antitheses to most gis software that has big binary dependencies in order to read and write many raster formats, and passes a lot of c pointers around.

My stakes are 1) this functionality is crucial for transferring my workflow to julia, and 2) I'd prefer to use something better than VerySimpleRasters. So I'm really happy to merge/subsume/drop any VerySimpleRasters functionality, and also to contribute within the limits of me actually not knowing very much about existing geospatial data types.

I'm really enthusiastic about the developments here - happy to join a call if I can be useful.

Alexander-Barth commented 4 years ago

For Fabian's interface, I think it would be necessary to add a function like gridcoordinatesname (or any shorter variant :-) ) which would help to identify that the first dimension is actually "longitude", (or latitude, elevation, time ...).

For what it is worth, here is an example of an API which uses the arbitrary grid as a general case: for e.g. a 3D data array, the coordinates are all 3D arrays too, but it uses a special type RepeatedArray is used to store them efficiently for the common case where the dimension are aligned with longitude/latitude/elevation/time....

https://gist.github.com/Alexander-Barth/41f70c292ec9cdac2543e6c6a01ef612

The advantage is that dimension names (which are not standardized) are not exposed to the user and one can more easily write generic code which work for different datasets. Optimization for the special and common cases (aligned coordinates) can be handled using julia's dispatch (but I have not tested this yet).

EDIT: The disadvantage is that users are nudged/forced to think about the arbitrary case, even is the dataset that they are using has a quite simple structure.

mkborregaard commented 4 years ago

For arbitrary grids, don't we just need 1) an array, and 2) a transformation function of coordinates onto array indices? Like GeoRasters has?

Alexander-Barth commented 4 years ago

Yes, it boils down to a transformation function, but which is (in the context of netcdf files) usually defined as 2D arrays of longitude and latitude (representing the coordinates of every grid point). I am not familiar with GeoRasters to say if such cases can be handled.

rafaqz commented 4 years ago

I think it would be necessary to add a function like gridcoordinatesname (or any shorter variant :-)

@Alexander-Barth was just thinking this too. In DD the IrregularGrid trait would have its own dims method (which is also much shorter ;) that returns a tuple of the dimension names after the transformation.

Then we could just index with A[Lat(x), Lon(y), Time(At(t))] and not even think about the transformation from Dim{xi_rho} etc that happens underneath.

Edit: If it wasn't clear, the array dims field would itself hold the tuple (Dim{:xi_rho}, Dim{eta_rho}, Time). An IrregularGrid trait holding the transform function or matrix lookup, and the transformed dimension names tuple (Lat, Lon) would be attached to the first two dims. Time would just have a RegularGrid trait.

That would be general in terms of dimension number, order and names for both the array and transformation.

rafaqz commented 4 years ago

The pull request implementing something like what @meggart proposed is here:

https://github.com/rafaqz/DimensionalData.jl/pull/11/files

It's still a while off being finished but it seems fairly doable.

juliohm commented 4 years ago

Dear all,

Did we set a date for our e-meeting? I am looking forward to brainstorming the issues we raised.

Best,

JuliaGeo / meta

Common interface for raster data/DimensionalData.jl #6