boutproject / xBOUT

Collects BOUT++ data from parallelized simulations into xarray.
https://xbout.readthedocs.io/en/latest/
Apache License 2.0
21 stars 10 forks source link

Regions don't work with sliced Datasets #115

Open johnomotani opened 4 years ago

johnomotani commented 4 years ago

The regions introduced in #107 are based on slicing the Dataset with global indices. The indices stored in Region objects will not be correct if the Dataset is sliced for some reason, e.g. ds = ds.isel(x=slice(20, None).

One way to resolve this might be to add global index coordinates, e.g. for a dataset with dimensions {'x', 'theta', 'zeta'} have some coordinates {'x_ind', 'theta_ind', 'zeta_ind'} that give global integer indices. Then we could use this in Region.getSlices() (which might need an extra Dataset or DataArray argument to get the coordinates from) to get offset. E.g. if the passed-in DataArray has x_ind that does not start from 0, can subtract the da[x_ind][0] from Region.xinner_ind, and check that Region.xinner_ind is within the bounds of x_ind.

TomNicholas commented 4 years ago

If the regions are specified along the dims 'x' and 'y', can we not just promote 'x' and 'y' to be 1D coordinates? Then their values would be preserved upon slicing.

The only disadvantage I can think of that approach would be if you wanted x=0 to be within the domain, and x=-1 to be in the left-hand boundary cells, then you would end up with an x coord which had a value -1 where it's corresponding dim was index 0...

johnomotani commented 4 years ago

Yes, naming the coordinates x, y and z would make sense. I think I was worried that that would get confusing when the dimensions are sometimes renamed and sometimes not, e.g. if you rename dimension y to theta, what happens to the coordinate y? Maybe it just needs to be created after the dimensions are renamed...

It would be nice to have the indices count just 'real' grid cells, but in practice I would start the coordinates from 0 in the boundary cells. Otherwise it's implied that the coordinate y should not count any boundary cells, which would make it double-valued in the upper target boundary cells (if there are any).

TomNicholas commented 4 years ago

I think I was worried that that would get confusing when the dimensions are sometimes renamed

I think this is an argument for not renaming dimensions at all. There isn't a penalty to carrying around multiple coordinates, so I think theta should be added as a new coordinate, without replacing y. (I realise that might be different from what I've advised before but if there's a clear need for x and y to be coordinates then I think we should always keep them).

There might be one or two functions in xarray which should accept 1D non-dimension coordinates but currently would refuse to, but the neatest solution to that would just be to generalise upstream (I've submitted at least one PR doing that before).

johnomotani commented 4 years ago

I guess the biggest one would be: can you isel(theta=42) when theta is a coordinate but not a dimension?

TomNicholas commented 4 years ago

I guess the biggest one would be: can you isel(theta=42) when theta is a coordinate but not a dimension?

No you can't. That makes some sense though - theta doesn't have a value of 42 anywhere, it only takes values between 0 and 2pi in any tokamak. To select the 42nd index you conceptually should be selecting along a dimension index, not a coordinate, with isel(y=42). (You could imagine changing .sel() so that it dispatched to .isel() if passed a dimension, but I expect that would lead to lots of hard-to-debug mistakes in user code).

You should be able to do sel(theta=42), but unfortunately you currently can't. This was something I've been meaning to add because it seems pretty crucial!

johnomotani commented 4 years ago

dimensions with the same names as coordinates seem a bit confusing in general... especially since BOUT++ uses x/y/z both as coordinates (with spacing dx/dy/dz) and often (in the code) as indices. dx, dy and dz are available in the Dataset, so setting y to be a grid-cell-number coordinate is arguably just as confusing as isel(theta=42) is at the moment... Not sure what a good naming convention is though... (i,j,k) is a bit likely to have name clashes. (ix, iy, iz), (jx, jy, jz), (ix, jy, kz) are possible but a bit obscure. If we put some naming suggestions as individual comments, we can vote on them with reactions. Feel free to add more if you think of them! ...

johnomotani commented 4 years ago

Question: what should we name the dimensions and grid-cell-index coordinates in xBOUT?

johnomotani commented 4 years ago

psi, theta, zeta

johnomotani commented 4 years ago

x, y, z

johnomotani commented 4 years ago

psi_ind, theta_ind, zeta_ind

johnomotani commented 4 years ago

x_ind, y_ind, z_ind

johnomotani commented 4 years ago

i, j, k

johnomotani commented 4 years ago

ix, iy, iz

johnomotani commented 4 years ago

jx, jy, jz

johnomotani commented 4 years ago

ix, jy, kz

ZedThree commented 4 years ago

I'd be happy with any variation on x/y/z_ind, i/j/k, ix/iy/iz or jz/jy/jz as long as it's documented up front :) I prefer those options as they're a little bit clearer that they're index-space, or at least likely not real space.

Can I veto any variation that uses psi/theta/zeta though? That heavily implies a particular coordinate system.

kz makes me think "wavenumber", so -1 for that option :smile:

johnomotani commented 4 years ago

@ZedThree sorry, was writing the above a bit too late... The 'psi/theta/zeta' options would depend on the geometry chosen by the user. At the moment if you pass geometry='toroidal' to open_boutdataset(), when it calls apply_geometry() it goes through to add_toroidal_geometry_coords(), which renames the dimensions as x/theta/zeta (x because psi_poloidal is not necessarily a 1d coordinate due to possibly different dx in PFR core). Other geometry implementations can do the same (e.g. s-alpha uses 'r' for the radial coordinate). So psi/theta/zeta should really be 'geometry-choice-dependent, and possibly user-defined, for example psi/theta/zeta', and similarly 'psi_ind/theta_ind/zeta_ind'.

ZedThree commented 4 years ago

Ah! In that case, definitely the psi_ind version!

johnomotani commented 3 years ago

Just came back to look at this issue again. It's what prevents plotting selections of the poloidal plane from working.

The problems discussed above I think are now mostly resolved, since we added bout_tdim, bout_xdim, bout_ydim and bout_zdim to the metadata to keep track of the dimension names.

Fixing this now needs a bit of extra work to:

  1. add global grid-index coordinates (which would be more convenient to use with ixseps1, jyseps1_2, etc.). For example if ds.metadata["bout_ydim"] == "theta" we could add a coordinate called theta_global_ind.
  2. update the methods in region.py to use those global coordinates i.e. to replace current things like (this isn't actual code, just to illustrate changing isel to sel)
    xcoord = ds.metadata["bout_xdim"]
    result = ds.isel({xcoord: slice(lower, upper)})

    to something like

    xcoord = f"{ds.metadata['bout_xdim']}_global_ind"
    result = ds.sel({xcoord=slice(lower, upper)})

    [and we'll have to be careful because isel excludes the end-point in the normal Python way, but sel includes the end-point (I guess because it works with arbitrary labels, not just integers and if you had a coordinate that was ("yellow", "green", "blue") it's weird to need to do .sel(color=slice("yellow", "blue") to get just "yellow" and "green".]

  3. Might also need some special handling for the case when a region has no points left in it after a Dataset is sliced.