Interop/testing - Githubissues

rafaqz commented 3 years ago

If you want another test case I'd like to implement this interface in DimensionalData.jl, and eventually use it in DynamicGrids.jl to support more kinds of spatial/temporal arrays.

Is it ready for that? I could start with a branch in DD.

Tokazama commented 3 years ago

That would be great! Probably the easiest first step is getting ArrayInterface.dimnames to return a Tuple{Vararg{StaticSymbol}}.

rafaqz commented 3 years ago

That should be a one-liner ;)

Tokazama commented 3 years ago

That should be a one-liner ;)

That's largely the goal with interoperability here. There will always be more that package authors can do to take advantage of ArrayInterface.jl, but there's no reason that use of the basic traits here should be any more complicated that getting dimension names.

rafaqz commented 3 years ago

Started here: https://github.com/rafaqz/DimensionalData.jl/blob/array_interface/src/array_interface.jl

rafaqz commented 3 years ago

Ok, getting my head around the traits here. Most of these seem useful and possible to integrate in DimensionalData.jl. But I have some thoughts and questions!

It seems the interface is designed to work on the array type (as opposed to the object)? Is that correct? This should be fine, but will need some extra methods in DD.
There are methods like pixel_spacing but the contents of a DimArray and other array types is only sometimes pixels. Should we generalise the name, something like axis_step maybe?
The spacing/time-step in DD may be fixed or variable along the axis. Should we just error if there is no fixed spacing? But generally it can't be assumed that a spatial or temporal dimension has a fixed step size. This may need a trait.
DD has the concept of Intervals as well as Points, which may complicate methods like spatial_offset, duration and time_end as e.g time_end might be interpreted as the last time in the axis for points, but the end of the interval that starts there for intervals.
What happens to non-spatial/temporal dims? what if one dim is categorical?
should onset be time_start for symmetry with time_end? it took me ages to find onset and realise it was time_start!
what kind of object is time_axis and spatial_axes returning?
what are times and spatial_keys returning? if time_axis is a range they will be just the indices for the range?

Tokazama commented 3 years ago

It seems the interface is designed to work on the array type (as opposed to the object)? Is that correct? This should be fine, but will need some extra methods in DD.

That's that basic idea, but the focus is really on the traits here. I've definitely considered other traits that are less focused on cartesian grids, but I figured we should start with the obvious. What other objects did you have in mind?

There are methods like pixel_spacing but the contents of a DimArray and other array types is only sometimes pixels. Should we generalise the name, something like axis_step maybe?

That's a good point. What pixel_spacing does right now could also be done with something like map(step, map(keys, axes(x, is_spatial))). So what we really want is the step size in the keys of spatial indices. I think it makes sense to have a specific name for the spatial and temporal version of these functions so that they could be used outside of named dimensions (e.g., geometric objects). I'm not sure what the best name is for that. Maybe spatial_units?

The spacing/time-step in DD may be fixed or variable along the axis. Should we just error if there is no fixed spacing? But generally it can't be assumed that a spatial or temporal dimension has a fixed step size. This may need a trait.

I've gone back and forth on this one a lot. Even if you have a fixed step size, subsets of that array can easily disrupt this. I think the default is to just return 1, but that has problems too. I'd be interested in what @timholy thinks about this because it can affect JuliaImages.

DD has the concept of Intervals as well as Points, which may complicate methods like spatial_offset, duration and time_end as e.g time_end might be interpreted as the last time in the axis for points, but the end of the interval that starts there for intervals.

Could you give a more concrete example of this. I think I understand but I want to make sure that I actually address this before I start spouting off ideas.

What happens to non-spatial/temporal dims? what if one dim is categorical?

The design is very similar to ImageCore.jl and ImageAxes.jl in its current form. Basically everything is considered "spatial" except for time dimensions. The difference here is that we have this line and you can explicitly say a dimension name is not spatial. I had a macro where you could do @not_spatial name at one point, but I feel like there needs to be a discussion before implementing something like that that could potentially create invalidations.

should onset be time_start for symmetry with time_end? it took me ages to find onset and realise it was time_start!

I'd be open to that. I've also considered first_time and last_time.

what kind of object is time_axis and spatial_axes returning?

what are times and spatial_keys returning? if time_axis is a range they will be just the indices for the range?

time_axis and times where supposed to be similar to axes(x, :time) and keys(axes(x, :time)), respectively. I think it might make more sense to just get rid of these and just call these directly under the hood, because you should be able to get a type stable return of these using axes(x, is_time).

rafaqz commented 3 years ago

What other objects did you have in mind?

I just meant dispatch on the array as constructed object ::SomeArray rather than ::Type{SomeArray}. But it's not important, type based traits are ok.

Maybe spatial_units

I think the spatial/temporal difference is fine, but I would hesitate from using _units in the method for the step size. Units and the step size are often separate fields, and units may not be applied in practice (e.g. Unitful.jl conversion does not yet exist for a lot of string format units). Also falling back to 1 will lead to mistakes with unevenly spaced axis index, I think an error is better?

Could you give a more concrete example of this. (Intervals/Points difference)

Spatial data such as geotiff can distinguish between the pixels representing the mean over an interval or the value at a single point. One consequence of this is a DimensionalData.jl Selector like Between can give different answers for points or intervals. Points are seen to be between two values if they simply fall between them. Intervals are between two values when they also start and end between the values. If this distinction isn't made, the answer for intervals may be different depending on whether the start or center of the interval is stored, even though the interval they represent is meant to be the same. Both options are used in various formats.

Another example for both the last points is that netcdf even has a format where the start and end of every interval is specified explicitly in a 2 row matrix, that may or may not be regularly spaced and contiguous.

Tokazama commented 3 years ago

It's pretty common to have the same issue in brain imaging where an index could refer to any of the 8 corners, the center, or the entirety of a voxel. This isn't an issue if you have a common coordinate system though. Maybe coordinate space traits would be helpful? I'm not really sure if there's much you can do for irregularly spacing, except through an error when trying to get something like the step size.

rafaqz commented 3 years ago

Interesting you have the same issue. DD has traits to handle that - Locus of Start, Center and End for each axis takes care of any combination.

This isn't an issue if you have a common coordinate system though

Can you explain this a little more? How would this solve the Between problem? You mean converting the index first?

I've tried not to convert to common formats eagerly in DD as users may want to save exact copies of e.g tiff or netcdf without conversion error of the coordinates, and also so that everything is lazy as it may not even be used.

Tokazama commented 3 years ago

I'm going to use ( | | ) to represent a single key to an index where the real world reference are positions 1-9

start:     ( 1 |   |   ), ( 4 |   |   ), ( 7 |   |   )
center:    (   | 2 |   ), (   | 5 |   ), (   | 8 |   )
end:       (   |   | 3 ), (   |   | 6 ), (   |   | 9 )

The step size and length are the same for each of these but the initial offsets are 0, 1, and 2 (respectively).

If you want the data at point 2 you can only really get that for the centered position. If you want the data where the key is <=(5) then you would get the indices 1:2, 1:1, and 1:1.

And if the had the same keys but different pixel/voxel locations such as ...

start:     ( 1 |   |   ), ( 2 |   |   ), ( 3 |   |   )
center:    (   | 1 |   ), (   | 2 |   ), (   | 3 |   )
end:       (   |   | 1 ), (   |   | 2 ), (   |   | 3 )

The locations represented would actually be 1:5, 0:4, and -1:3.

If I understand the problem you're describing correctly, then the issue is that if you wanted to average all positions at the real world location 1 then you would have to account for these different reference systems.

In neuroscience we're at the mercy of whoever has created the machine that collects the data and the technician that operates it. Before we do anything where spatiotemporal data from different modalities, hardware, or acquisition protocols interacts in any way we have a template registration step. Encoding whether each position represents any of the eight corners or center of a voxel doesn't actually help us at all because everybody's anatomy is different. Every brain image or electrode array needs to be registered to a common template space before they interact in any way. If we want to look at population averaged information in an individuals brain space, we have to register that population average to the individual's brain. Once that's done most issues are taken care of. The only situation registration doesn't resolve this problem is when constructing a mesh with values (e.g., doing marching cubes), but I'm pretty sure the actual values would have to be interpolated no matter the key position in the voxel.

rafaqz commented 3 years ago

Yep, pretty much. In spatial data those points are fixed e.g. lat/lon on the earths surface, as defined in some coordinate system. As it's always the same earth, not different brains, they need to be accurately comparable to a file in a different format that may use a different start/center/end position. This might amount to e.g. many kilometers of error, and may lead to selecting array subsets of different sizes/spatial extents with the same command, depending on the file format.

Another thing I hadn't got my head around is your examples imply that axes returns the objects that hold the axis keys? DD doesn't do that, it forwards base methods like axes to the parent array as it often makes sense to work with them as regular arrays. dims gets the tuple of Dimension objects that hold the axis keys. The interface probably needs another method that could fall back to axes.

Tokazama commented 3 years ago

Sorry for the severe delay in response.

This might amount to e.g. many kilometers of error, and may lead to selecting array subsets of different sizes/spatial extents with the same command, depending on the file format.

To be clear I don't think this isn't an issue. I'm just explaining why my intuition is telling me this sort of thing isn't a dimension-wise trait and is more of a global coordinate space specification. I still need to think about that sort of thing though. I'm open to different solutions if there is a clear way to integrate it here though.

Another thing I hadn't got my head around is your examples imply that axes returns the objects that hold the axis keys?

This actually originated from a conversation with @mcabbott. I don't have access to that conversation anymore, but keys seemed like the best name for what we were referring to. I can go into more detail about this if you'd like. If you have a very specific set of types that are optimized for your workflow then it might make more sense to overload some of the methods for those types. If what we have here isn't generic enough to easily facilitate it then we do need to change the code here.

rafaqz commented 3 years ago

Ok no problem, we can keep this conversation going as points/intervals questions come up.

And I agree keys is a good method. Maybe I'll move to using it in DD. I was mostly referring to the fact that axes in DD just returns the regular array axes and wont work in your examples, while dims actually gets the dimension objects that keys can be applied to.

AxisArrays.jl has its own axes method, but I find that pretty confusing given what Base.axes means.

mcabbott commented 3 years ago

I haven't read this closely, but on terminology: I think AxisArrays.axes(A) predates Base's use of the term (IIRC indices(A) used to be this tuple of OneTos) but it's now pretty confusing. I think I'm to blame for calling the other things (distances, or whatever) "keys", forgetting that Base also uses that as a synonym for indices, more or less. Overloading Base.keys (on an AbstractArray) leads to bad things happening. A less confusing term might be "axis labels".

rafaqz commented 3 years ago

Ok interesting. I've avoided overloading base methods in DimensionalData for roughly these reasons. Its good if spatial/named arrays behave for most general purposes as generic AbstractArrays. And our additional behaviours just use new functions. (Edit: But I'm doing a lot more with getindex than it should, although base behavious dont change. I just really like those square brackets ;)

axis_labels is good, although it could be confused with e.g :X and :Y. Unless we had axis_names for that? Other potential names for keys are axis_values/axisvalues, axis_lookup/axislookup and similar.

mcabbott commented 3 years ago

axis_labels is good, although it could be confused with e.g :X and :Y.

I would argue here that NamedDims has taken the lead, and terms those "dimension names". Base uses dims = 2 to indicate which of 1:ndims(A), and NamedTuple for a thing with Symbols in its type as names, so that seems pretty consistent.

axis_values/axisvalues

My opinion is that value wants to be what's really in the array, like setindex!(array, value, index...). Although I'm not certain Base really has this in code anywhere.

But I'm doing a lot more with getindex than it should, although base behaviours don't change

It seems less confusing to me to abuse syntax than words. Nobody actually calls getindex(A, X="cat"), they write A[X="cat"] so the fact that "cat" is not an index (meaning Int or CartesianIndex) doesn't jar, IMO.

Tokazama commented 3 years ago

Thanks for chiming in @mcabbott. keys is not used on the array but on the individual axes, so it's not really an issue. The issue with naming/syntax here is specifically related to what the "keys" of the spatial axes should be called. If we end up needing to use something that is very general and applies to all arrays, then it should probably go into ArrayInterface.jl instead of this package.

JuliaArrays / SpatioTemporalTraits.jl

Interop/testing #2