Add algorithm for dual mesh creation for edges

Chilipp commented 3 years ago

Hey @ChrisBarker-NOAA and @jay-hennen,

here is the next part of my contribution concerning the dual mesh creation mentioned in #59. It's an algorithm to create the dual mesh for the edges (see the third figure in https://github.com/NOAA-ORR-ERD/gridded/issues/59#issue-756511388). It works for both, masked arrays (when you have mixed shapes for instance, see the test_create_dual_edge_mesh_na unittest), and regular meshs (e.g. triangular, see test_create_dual_edge_mesh unittest).

I also fixed a bug in one of the methods I introduced in #60 with 2d77a5f and added a method to get the orientation for an edge in the face. This method is necessary to make sure that the nodes in the face_node_connectivity of the dual mesh is ordered in the correct anticlockwise order.

The algorithm does not really read that nicely as it's a lot of numpy indexing stuff, but I tried to include quite some comments to make it a bit understandable.

concerning the efficiency: I had to introduce some cython code that basically inverts the face_edge_connectivity into an edge_face_connectivity. I couldn't come up with a better solution than that. I tried with cKDTree but it broke for a larger mesh with about 100'000 faces. Hopefully that's okay for you. The implemented solution is quite efficient, creating the dual mesh for one million faces took about 3.8 seconds.

The last step concerning #59 would then be to create an equivalent method for the nodes.

Huite commented 3 years ago

(@ChrisBarker-NOAA Excuse my radio silence, I've had things come up unexpectedly last year, this year should be better...)

For the record, I was looking at some vectorized approaches, here's a first attempt:

def build_edge_face_connectivity(
    face_edge_connectivity: np.ndarray,
    n_edge: int,
    fill_value: int
) -> np.ndarray:
    n_face, nmax_edge = face_edge_connectivity.shape
    # Get rid of the fill_value, create a 1:1 mapping between faces and edges
    isnode = (face_edge_connectivity != fill_value).ravel()
    face_index = np.repeat(np.arange(n_face), nmax_edge).ravel()[isnode]
    edge_index = face_edge_connectivity.ravel()[isnode]

    # We know that every edge will have either one or two associated faces
    isface = np.empty((n_edge, 2), dtype=np.bool)
    isface[:, 0] = True
    isface[:, 1] = (np.bincount(edge_index) == 2)

    # Allocate the output array
    edge_face_connectivity = np.full((n_edge, 2), fill_value, dtype=np.int64)
    # Invert the face_index, and use the boolean array to place them appropriately
    edge_face_connectivity.ravel()[isface.ravel()] = face_index[np.argsort(edge_index)]
    return edge_face_connectivity

I ran a quick test locally, seems to work okay. @Chilipp It might be worthwhile to run a quick benchmark? The key "trick" is using boolean indexing on ravelled arrays to deal with those fill values.

Typical downsides of vectorized approach on full display of course:

Gotta allocate a lot of arrays for indexing and masking, so performance should be worse. I'm counting at least 7 allocated arrays versus 2 on the cython implementation (which I think could be reduced to 1?).
Pretty rough in terms of readability.

I've also got a few other vectorized approaches (edge_node_connectivity, face_edge_connectivity, face_face_connectivity), but it might be worth considering whether cython/numba isn't better (performance, readability; my vote is for numba versus cython, since people can install it easier than a C compiler, especially on Windows).

Chilipp commented 3 years ago

hey @Huite! awesome! this algorithm works much better indeed, runs faster and reads much better, too! Thanks a lot for the input. I'll update the PR and implement your method. I made you a collaborator on the fork. So if you want to contribute more, go for it =)

Chilipp commented 3 years ago

@ChrisBarker-NOAA and @jay-hennen, from my perspective this is ready to be reviewed

Chilipp commented 2 years ago

hey @ChrisBarker-NOAA and @jay-hennen! It's been a while that I submitted this PR. Is there any interest in merging this functionality (and the one from #62) into the gridded package? If so, I can resolve the merge conflicts, but I not I implement this in a separate package

ChrisBarker-NOAA commented 2 years ago

Sorry for the radio silence.

Yes, let's merge it -- I don't have the time to properly review it, but it looks like it won't break anything existing, so let's put in it.

A couple thoughts from a quick glance:

1) masked arrays -- in the UGRID netcdf standard, we tend to use -1 for indexes that are non-existent -- I"m not sure to what extent we should translate those to masked arrays in the Python data model -- but we shold be consistent in the whole package.

2) Cython: I'm a fan of Cython, but it does introduce an additional burden for building and distributing the package. Could you look into making the Cython parts optional?

3) edge_orientation: is it necessary to keep this in the Python model? or could we define a single standard, and then translate in I/O ? (e.g. what we do for Fortran-style indexing)

ChrisBarker-NOAA commented 2 years ago

Oh, is #62 still needed? or is there just this one now?

Chilipp commented 2 years ago

hey @ChrisBarker-NOAA! Thanks for the immediate feedback :smiley:

Oh, is #62 still needed? or is there just this one now?

well, actually the question is whether this PR is needed anymore as #62 implements the commits of this PR as well. But at that time I though it might be useful to separate the two things as one needs Cython (#62)\, the other one doesn't

masked arrays -- in the UGRID netcdf standard, we tend to use -1 for indexes that are non-existent -- I"m not sure to what extent we should translate those to masked arrays in the Python data model -- but we shold be consistent in the whole package.

I am fine with that. But we should make sure that the netCDF I/O function converts the netCDF data with your -1 policy. However I must admit that personally I feel uncomfortable with the -1 policy. Subsequent software that uses the gridded package should account for the fact that there might be missing values,. The safest way would be to use something like n+1 instead of -1 to intentionally produce errors. Using -1 will not produce errors with python, although the coordinates are wrong. But anyway, if that is the policy, I am fine with that.

Shall I change the netCDF I/O-method?

Cython: I'm a fan of Cython, but it does introduce an additional burden for building and distributing the package. Could you look into making the Cython parts optional?

I can't see a way how to make Cython optional. Of course you can implement it as such that you do not need Cython to be installed for the installation, but I guess you will then have to use wheels and/or conda for making the installation work seamlessly on all platforms.

The only way that I see to make the Cython completely optional is to move to code to a separate package.

edge_orientation: is it necessary to keep this in the Python model? or could we define a single standard, and then translate in I/O ? (e.g. what we do for Fortran-style indexing)

it's been quite a while that I implemented this, sorry :sweat_smile: I'll come back to you about this tomorrow

ChrisBarker-NOAA commented 2 years ago

well, actually the question is whether this PR is needed anymore as #62 implements the commits of this PR as well. But at that time I though it might be useful to separate the two things as one needs Cython (#62), the other one doesn't

OK, if it's clean an easy to keep them separate then let's do so, but if it's easier to do it all at once, that's fine too.

As for Cython dependecy -- they way to make it optional is to have a non-cython (presumably slow) code path, and put the actual Cython code in its own module. then at run time, you do a:

try: from the_cython_module import the_function_you need except ImportError: pass

then the slow one will get overridden by the fast one, if it's there, and not otherwise.

At build time, we look for Cython, and if it's there, build the cython module, and if not, don't build it, and the pure Python version should get used. Maybe print a warning for the user that an optional accelerator was not built.

For the -1 indexes -- good point, sometimes it's tricky that -1 is valid as an index in Python.

So maybe that the opposite route: and make them always masked in Gridded Python code. In any case, itt should be consistent in all of gridded, and any udjusing for the netcdf (or other) specs should be done on I/O

Honestly, I don't remember exactly what gridded is doing right now :-)

A challenge here: sometimes we want lazy-loaad arrays -- so that inside gridded an array may be a netCDF variable, not a numpy array, and thus not be masked. I think that that's only teh case for data arrays, and the grid parameters are always fully realized numpy arrays, but again, I'm not totally sure about that.

Chilipp commented 2 years ago

hey @ChrisBarker-NOAA! my apologies for the silence.

As for Cython dependecy -- they way to make it optional is to have a non-cython (presumably slow) code path, and put the actual Cython code in its own module.

fine with me. I do not see the benefit to as the package on PyPi etc. should contain the compiled Cython code anyway, but ok. Or are there no intentions to publish the gridded package on pypi.org and conda-forge?

So maybe that the opposite route: and make them always masked in Gridded Python code. In any case, itt should be consistent in all of gridded, and any udjusing for the netcdf (or other) specs should be done on I/O

I am fine with anything :) To my knowledge, no software so far found a solution to NaN in coordinates. xarray just converts this then to a float-array and fills it with np.nan. But I think this would not be a good approach for the UGRID conventions.

Honestly, I don't remember exactly what gridded is doing right now :-)

me neither :sweat_smile: But I'll have a look and come back to this if you like

A challenge here: sometimes we want lazy-loaad arrays -- so that inside gridded an array may be a netCDF variable, not a numpy array, and thus not be masked. I think that that's only teh case for data arrays, and the grid parameters are always fully realized numpy arrays, but again, I'm not totally sure about that.

I think it's totally ok to load the coordinate information directly into memory. That's what xarray is doing as well when you open a dataset.

ChrisBarker-NOAA commented 2 years ago

As for Cython dependecy -- they way to make it optional is to have a non-cython (presumably slow) code path, and put the actual Cython code in its own module.

fine with me. I do not see the benefit to as the package on PyPi etc. should contain the compiled Cython code anyway, but ok. Or are there no intentions to publish the gridded package on pypi.org and conda-forge?

Well, if you build wheels and conda packages, then yes, not an issue. And we do intend to support conda-forge for conda, not sure about wheels on PyPi, the dependencies are ugly in a way that PyPi does not support well.

And yes, we can (and probably should) ship the generated C code. But not everyone can compile Python extensions.

Maybe int he world of binary Wheels and conda packages, it's not longer necessary, but let's try it that way for now.

And it's sometimes handy to have a pure-python version for testing, etc, as well.

As for the NaN vs Masked vs -1 : give something a try and see what works, ideally without changing much in how the current code does it :-)

ChrisBarker-NOAA commented 1 year ago

@Chilipp: This seems to have gotten dropped by both of us :-(

We are taking a fresh look at gridded, and may be doing some refactoring -- so I"m poining you to see if you are still interested it getting this done.

Chilipp commented 1 year ago

Hi @ChrisBarker-NOAA ! Good to hear from you. Glad to hear that you think about the refactoring. Yes, im am still interested, although my time is at the moment very limited. My hope is that I can work on this topic in April or May.

NOAA-ORR-ERD / gridded

Add algorithm for dual mesh creation for edges #61