gammapy / gammapy

A Python package for gamma-ray astronomy
https://gammapy.org
BSD 3-Clause "New" or "Revised" License
233 stars 198 forks source link

Axis ordering in gammapy.maps #1180

Open woodmd opened 7 years ago

woodmd commented 7 years ago

In gammapy.maps we currently use a different axis ordering convention for the data member and accessor methods. Following the convention of FITS the data member has a column-major ordering where the dimensions of the image come last to first. An advantage of this ordering is that it matches the axis ordering one gets when reading an image from a FITS file, e.g. with astropy.io.fits. On the other hand the accessor methods use row-major ordering with map dimensions ordered from first to last. This is the convention used in astropy.wcs. To me row-major feels like the more natural convention since it preserves the index of an axis when working with maps of different dimension, e.g. longitude is always mapped to index 0 in the row-major scheme vs. index N for column-major where N depends on the dimension of the map.

I wanted to ask here if we should consider switching to a row-major order everywhere in gammapy.maps by making the data member row-major. The current scheme entails transposing the data array in every accessor method call. I don't think there's a big overhead to this (transpose is just making a view on the original data) but it does add some extra complexity and is potentially confusing for users. One possible disadvantage of switching to row-major order is that operations on image slices will be slightly less efficient than with column-major order. However this is probably a small effect that will be highly dependent on the specifics of the CPU architecture. To the extent we find that this is important we always have the option of using the order property of np.ndarray to change the internal memory representation.

A related question is whether we want to allow additional freedom in the axis ordering. Currently gammapy.maps fixes longitude and latitude as the first and second dimensions of the map followed by a set of non-spatial dimensions with an order specified by the user. We could consider allowing the user to specify the desired order of the axes (e.g. swapping latitude and longitude). I don't think there would be much to be gained with this flexibility but I thought I would bring it up in case anyone felt this would be a valuable feature.

cdeil commented 7 years ago

@woodmd - I'll think about it and look around what others do. I'll get back to you.

Mainly I want to look at http://xarray.pydata.org/ , but also e.g. https://spectral-cube.readthedocs.io . My first though is that labeled axes and always working via MapCoords would be a good thing, as it would support arbitrary axes orders, and we have full flexibility to support any data as it comes in, or re-arrange axes for optimum performance for analysis, and just accessing axes by name is more readable code than by index in plain tuples.

My understanding is that xarray is very well done by numpy experts, has seen many use cases (comes from meteorology and they have many maps with different extra axes and files formats and petabytes of data, but now also has been used in other domains) and probably we should just do things as much as possible like they did, but make it work for our applications. Did you look at xarray already? What do they do concerning axis ordering and the questions you pose here?

cdeil commented 7 years ago

I looked a bit at xarray and spectral-cube. So my understanding is that xarray is completely flexible with axis order, and spectral-cube is completely fixed.

It's a bit cryptic what they do exactly here, i.e. if / when the transpose or re-order data arrays if they come in with an axis order they don't like: https://spectral-cube.readthedocs.io/en/latest/accessing.html#data-values I didn't look at their code yet.

The key thing I like about xarray is that they have a string name for each dimension in the data_array.dims attribute. See http://xarray.pydata.org/en/stable/data-structures.html .

This means that one can get / set by axis name, which makes calling code independent of the axis order of the underlying data, and just more explicit and readable. Note that we have something similar of already for NDData where evaluate calls out axes by name: https://nbviewer.jupyter.org/github/gammapy/gammapy-extra/blob/master/notebooks/nddata_demo.ipynb and also SkyCube.lookup which calls out axes by name: http://docs.gammapy.org/en/latest/api/gammapy.cube.SkyCube.html

I think this would be my main suggestion for the axes / accessors / interpolation methods in gammapy.maps: that they access just by tuple: http://docs.gammapy.org/en/latest/maps/index.html#get-set-and-fill-methods If they accessed axes by name (either using keyword arguments, or dicts with axis names as keys like dict(energy=..., lon=..., lat=...)), then I think calling code would be clearer to read / maintain, and even more general because it would work with any axis order or underlying numpy array data order, and I think there would be no impact on performance for the important case where one has many positions at once, like evaluating on many lon / lat. I presume this could be implemented by using the existing MapCoords class, by just passing the input to the accessors always to make a MapCoords object, and by internally only working with such objects.

From what I see, this would be a significant change, but not a huge one. E.g. after looking at xarray, I'm pretty sure we don't need or want that as a dependency and to use their data classes. A major thing they do is to use the pandas Index machinery for their coords, and I don't think we need that really, our use cases for slicing and dicing and aligning axes are simple enough to just do it with Numpy (like what you have now already).

@woodmd - Note that astropy.wcs.WCS also supports in many parts now axis access by name, i.e. there special axes "longitude" and "latitude" and the "spatial" part of the WCS can be accessed by name, without having to know which dimension it is. I don't remember the specifics, but I think this is used throughout in the spectral-cube package, so that would contain examples how to do that. (and I didn't look what you do now, maybe you're already doing that in gammapy.maps).

Another thing that I think is really important is to have a good __str__ and repr` that shows which dimensions there are, how many pixels they have, what the axis order is, what the data and axis units are. Examples:

With xarray it's the same: whenever they load a DataArray, the first thing they do is print it to show what dimensions they have (and in their case also "coords", which I think is something we don't need, but maybe also want to introduce to have cached arrays of lon / lat coords that can then be used e.g. for model evaluation): http://xarray.pydata.org/en/stable/examples/multidimensional-coords.html

@woodmd - I'd like to help in the coming days / weeks to really flesh out and polish gammapy.maps. For now mostly by starting to use it and changing examples / part of Gammapy to use it, hopefully soon I'll be familiar with it and can help with implementation in gammapy.maps.

I don't have an opinion on your direct question above yet, whether we should transform the underlying data numpy array on I/O or maps creation, there whatever you think best is fine with me. But what do you think about this direction of named axes / accessors in the API, and having the underlying data be flexible / mostly a class-internal thing that callers don't rely on, i.e. can have any axis order?

woodmd commented 7 years ago

Did you look at xarray already? What do they do concerning axis ordering and the questions you pose here?

I haven't had a chance to look carefully at these yet. I think your suggestion for named accessor methods would partially reduce the need for a consistent indexing scheme so maybe we should focus on this first. As we had discussed before it was always my intention to add support for this but I just haven't had time to think about it as I've been mostly focused on the internals up to now.

I presume this could be implemented by using the existing MapCoords class, by just passing the input to the accessors always to make a MapCoords object, and by internally only working with such objects.

I think this is the right idea. Right now MapCoords enforces that same axis ordering as the maps classes so to implement this we would need to change the internal data member from a tuple to a dict. Then I think in the maps class methods we could use a construction like the following:

def someMethod(self, coords):
    # convert to MapCoords
    coords = MapCoords.create(coords)
    # Convert to coordinate tuple
    coords = coords.to_tuple(geom)

where the geometry object would carry the information about the axis ordering. The call to MapCoords.create could also accept dictionary input so that your suggested access pattern could be supported:

m.get_by_coords(dict(lon=..., lat...))

One question might be whether we need separate classes for carrying pixel coordinates and indices or if we could use MapCoords for both.

Another thing that I think is really important is to have a good str and repr`

I fully agree. As a start I guess we should just dump the index and name of each axis. However keep in mind that giving a concise/useful description of the pixelation in the spatial axes can be a little tricky when using HEALPix or multi-resolution geometries.

I'd like to help in the coming days / weeks to really flesh out and polish gammapy.maps.

Ok feel free to ping me on slack if you want discuss any of this further. I'm happy to help with any of your proposals so I think it's mostly a matter of prioritizing what you want to tackle first and organizing the development so that we're not stepping on each others changes.

adonath commented 5 years ago

Changing the axis ordering for Map is off the table for v0.15. I would choose convention over configuration here and stick with what we have, even for v1.0. The axis ordering is e.g. documented and explained here: https://docs.gammapy.org/dev/notebooks/maps.html#Accessing-Map-Data-Values

We now also have a nicer __repr__, which shows the axis order of the API:

WcsNDMap

    geom  : WcsGeom 
    axes  : ['lon', 'lat', 'energy']
    shape : (100, 100, 12)
    ndim  : 3
    unit  : 
    dtype : float32

My proposed long-term solution is probably to get rid of the tuple() based coordinate definition and always require to pass axis names via dicts or use a MapCoord object.

cdeil commented 5 years ago

Long-term solution: suggest to look at http://xarray.pydata.org/ or https://docs.astropy.org/en/stable/wcs/wcsapi.html and consider to either use directly, or mirror what they have.