UXARRAY / uxarray

Xarray extension for unstructured climate and global weather data analysis and visualization.
https://uxarray.readthedocs.io/
Apache License 2.0
150 stars 31 forks source link

Project Structure Design for the Arbitrary Precision Mode #339

Open hongyuchen1030 opened 1 year ago

hongyuchen1030 commented 1 year ago

Describe the feature you'd like added to this project Figure out how to design the APIs for using the arbitrary precision mode.

Proposed Design

Grid Object Initialization

By default, our UXArray project uses the python float for all calculations. But if users want to achieve the arbitrary-precision, they can set the multi-precision flag as True and the desired precision when initializing the Grid object. Then all operations in the UXarray will be done with the desired precision.


# The default floating point Grid object
float_grid = ux.Grid(verts) 

# The multiplrecision Grid object that's set to the 100 precision
multi_precision_grid = ux.Grid(verts, multiprecision=True, precision=100)

When mixing the multiprecision grid object and python float grid object:

we can give users two options: make both arbitrary-precision(catch up to the current arbitrary precision) or round both down to the floating points

Helper functions overloads

Currently, on my implementation branch, all grid functions and the related helper functions that needs to deal with the calculation based on the coordinates, are overloaded with the gmpy2.mpfr datatype, for example

def node_lonlat_rad_to_xyz(node_coord):
    """Helper function to Convert the node coordinate from 2D
    longitude/latitude to normalized 3D xyz.

    Parameters
    ----------
    node: list, python `float` or `gmpy2.mpfr`
        2D coordinates[longitude, latitude] in radiance

    Returns
    ----------
    node_cartesian: list, python `float` or `gmpy2.mpfr`
        the result array of the unit 3D coordinates [x, y, z] vector where :math:`x^2 + y^2 + z^2 = 1`

    Raises
    ----------
    RuntimeError
        The input array doesn't have the size of 3.
    """
    if len(node_coord) != 2:
        raise RuntimeError(
            "Input array should have a length of 2: [longitude, latitude]")
    if np.any(
            np.vectorize(lambda x: isinstance(x, (gmpy2.mpfr, gmpy2.mpz)))(
                node_coord)):
        lon = node_coord[0]
        lat = node_coord[1]
        return [
            gmpy2.cos(lon) * gmpy2.cos(lat),
            gmpy2.sin(lon) * gmpy2.cos(lat),
            gmpy2.sin(lat)
        ]
    else:
        lon = node_coord[0]
        lat = node_coord[1]
        return [
            np.cos(lon) * np.cos(lat),
            np.sin(lon) * np.cos(lat),
            np.sin(lat)
        ]
philipc2 commented 1 year ago

Hi @hongyuchen1030

I've given the design some thought, and I'm leaning towards having a separate class for arbitrary-precision arithmetic.

From a design and readability standpoint, having both fixed and arbitrary precision implementations separated by branches in our functions is not ideal.

I'd consider the following design:

  1. ux.core.Grid is our fixed-precision representation of a grid
    • Remains the default choice for users
    • Prioritized for new implementing new features
  2. ux.multiprecision.Grid or similar will be our fixed-precision representation of a grid
    • Development will be separate from the default Grid class
    • Can consider inheritance with ux.core.Grid, but would require some additional thought and considerations
    • Effort into porting Grid methods to support arbitrary-precision (similar to the current PR's that are up)
    • If applicable, new features can be specifically written for arbitrary-precision
  3. Conversion methods between fixed and arbitrary precision grids
    • ux.core.Grid.to_multi_precision and ux.multiprecision.Grid.to_fixed_precision methods

With this design, you could treat ux.multiprecision.Grid as a canvas for your implementations and to showcase your work for your thesis/dissertation. Additionally, it wouldn't limit you to what's already implemented in the current Grid class and allow for a faster turn-around with PR's. Given appropriate methods as described in bullet point 4, this also should keep your implementations relevant and applicable, even if the input dataset isn't in arbitrary-precision.

Let me know what you think. With the redesign finished and almost merged + released, I can spend some time with you setting up this design and working on any modifications. @aaronzedwick let me know what you think of this design, since you've also been working with Hongyu.

philipc2 commented 1 year ago

I've also created a multiprecision label that can be used to identify which issues/PR's are related to this development.

hongyuchen1030 commented 1 year ago

Hi @hongyuchen1030

I've given the design some thought, and I'm leaning towards having a separate class for arbitrary-precision arithmetic.

From a design and readability standpoint, having both fixed and arbitrary precision implementations separated by branches in our functions is not ideal.

I'd consider the following design:

  1. ux.core.Grid is our fixed-precision representation of a grid

    • Remains the default choice for users
    • Prioritized for new implementing new features
  2. ux.multiprecision.Grid or similar will be our fixed-precision representation of a grid

    • Development will be separate from the default Grid class
    • Can consider inheritance with ux.core.Grid, but would require some additional thought and considerations
    • Effort into porting Grid methods to support arbitrary-precision (similar to the current PR's that are up)
    • If applicable, new features can be specifically written for arbitrary-precision
  3. Conversion methods between fixed and arbitrary precision grids

    • ux.core.Grid.to_multi_precision and ux.multiprecision.Grid.to_fixed_precision methods

With this design, you could treat ux.multiprecision.Grid as a canvas for your implementations and to showcase your work for your thesis/dissertation. Additionally, it wouldn't limit you to what's already implemented in the current Grid class and allow for a faster turn-around with PR's. Given appropriate methods as described in bullet point 4, this also should keep your implementations relevant and applicable, even if the input dataset isn't in arbitrary-precision.

Let me know what you think. With the redesign finished and almost merged + released, I can spend some time with you setting up this design and working on any modifications. @aaronzedwick let me know what you think of this design, since you've also been working with Hongyu.

Thank you for your feedback and suggestions. So what you mean is two different Grid objects for different precision modes? I'm open to any good development design that can improve the functionality.

And also there's will be several things:

  1. I completely understand that the default float precision is preferred by most of the geoscience community, and I have taken that into consideration. However, during the implementation of the algorithms, I realized that precision issues can have a significant impact, especially when calculating intersection points. I've put in a lot of effort to handle these issues in functions like latlonbox and intersection point calculations, as well as the non-conservative zonal average. But when you start working with these functions, please keep in mind the importance of precision.

  2. As I mentioned before, all of my functionalities from now on will be overloaded with multi-precision usage. This will greatly benefit my thesis research progress, and I'll ensure that the docstrings are updated accordingly. This way, it will be easier for you to convert them into deliverable APIs.

  3. Currently, my main focus is on algorithm development, as I'm lagging behind a bit in that aspect. I would greatly appreciate your and @aaronzedwick's help in transforming my code into deliverable functions. Once the re-design is complete, it should be much easier for you to work on the API release.

philipc2 commented 1 year ago

I realized that precision issues can have a significant impact, especially when calculating intersection points. I've put in a lot of effort to handle these issues in functions like latlonbox and intersection point calculations, as well as the non-conservative zonal average. But when you start working with these functions, please keep in mind the importance of precision.

Can you point me towards any specific examples of this becoming an issue or when arbitrary-precision preferred, specifically in the scenario where a user is working climate / geoscience data? Are there any mesh files that you've worked with or noticed that benefit from using arbitrary precision or what types of meshes (finer resolution, mixed resolution, etc) would benefit would using arbitrary precision?

One way to rephrase the above statement is as follows: Given that our typical user is working with an unstructured grid (mesh + data) from some sort of model output and wanting to analyze it, are there any documented cases of functionality breaking due to fixed-precision, and at what point does it become necessary to use arbitrary-precision.

As I mentioned before, all of my functionalities from now on will be overloaded with multi-precision usage.

When you mention overloading the functions, do you mean it in an OOP sense of having separate methods for arbitrary and fixed precision depending on the input parameters, or branching inside the existing ones depending on whether we are in fixed or arbitrary land?

Instead of including both implementations under the same function, keeping them in separate classes (or at the minimum separate methods) would make our code significantly more readable and easier to maintain.

Currently, my main focus is on algorithm development, as I'm lagging behind a bit in that aspect. I would greatly appreciate your and @aaronzedwick's help in transforming my code into deliverable functions. Once the re-design is complete, it should be much easier for you to work on the API release.

This is understandable. You've been doing a great job on all of this and once the redesign is in I can start giving it some more attention.

philipc2 commented 1 year ago

Additionally, I'd like to keep the multi-precision implementations in a separate location, since spinning new people up and maintain it could become an issue, however I need to give your PR's a deeper look before sticking to this statement, as there do seem to be directly parallels between certain fixed and arbitrary precision functions.

This will greatly benefit my thesis research progress, and I'll ensure that the docstrings are updated accordingly

Would having a separate class be a hindrance to this?

hongyuchen1030 commented 1 year ago

Can you point me towards any specific examples of this becoming an issue or when arbitrary-precision preferred, specifically in the scenario where a user is working climate / geoscience data? Are there any mesh files that you've worked with or noticed that benefit from using arbitrary precision or what types of meshes (finer resolution, mixed resolution, etc) would benefit would using arbitrary precision?

One way to rephrase the above statement is as follows: Given that our typical user is working with an unstructured grid (mesh + data) from some sort of model output and wanting to analyze it, are there any documented cases of functionality breaking due to fixed-precision, and at what point does it become necessary to use arbitrary-precision.

  1. According to Paul, the unique coordinates detection has been an issue for the fixed-point precision implementation for a while, even for the TempestMap software. An simple example is: closely positioned points could degenerate

  2. How can you ensure the intersection point is precisely on both arcs? A point might seem to be correct at first (We can even assume it's within the 1.0e-15 precision so that the Python float will show the difference as exactly 0.0) But after some operations or iterations, the error accumulates (Since python cannot keep track of the number after the 53bits), And suddenly it no longer fits into both GCRs

  3. Are you sure your latlonbox perfectly encompasses the polygon? When it's actually 1.0e-16 smaller than the actual face boundary.

Similar things will happen along the pathway. As I mentioned before, I am not saying we must incorporate the multi-precision now into our release, and I have already utilized the error tolerance to solve this issue. Just keep in mind these issues still exist and the error tolerance is not always a clean solution.

hongyuchen1030 commented 1 year ago

When you mention overloading the functions, do you mean it in an OOP sense of having separate methods for arbitrary and fixed precision depending on the input parameters, or branching inside the existing ones depending on whether we are in fixed or arbitrary land?

Instead of including both implementations under the same function, keeping them in separate classes (or at the minimum separate methods) would make our code significantly more readable and easier to maintain.

To ensure faster progress, I will initially implement all functionalities using the multi-precision format. If feasible, I will also attempt to implement a float-type solution. However, you have the freedom to modify the code into any other format that is more readable or easier to maintain. And I will maintain the docstring for your reference. Both Paul and I agree that prioritizing the implementation of algorithms is crucial. My involvement in the software development process might be limited to providing functional code that accurately represents the algorithms.

hongyuchen1030 commented 1 year ago

Additionally, I'd like to keep the multi-precision implementations in a separate location, since spinning new people up and maintain it could become an issue, however I need to give your PR's a deeper look before sticking to this statement, as there do seem to be directly parallels between certain fixed and arbitrary precision functions.

Thanks for your great help. Sure it works for me and like I said, no rush on deciding the multi-precision design currently. And you can do any modification you feel more appropriate later.

Would having a separate class be a hindrance to this?

As for right now, I don't have much time to make it a separate class on my local. But I don't think that will not be a problem when we turn into the deliverable

aaronzedwick commented 1 year ago

Splitting it up in the final deliverable does make sense to me since most users aren't likely to use it anyway keeping it separate would make things cleaner. @philipc2 Would we have to make two functions every time a new feature is implemented? One for default precision and one for multi-precision? That might get a little annoying. Let me know if something comes up that I can help with, I would be glad to.

philipc2 commented 1 year ago

To ensure faster progress, I will initially implement all functionalities using the multi-precision format. If feasible, I will also attempt to implement a float-type solution.

By "all functionalities", do you mean any new features that you are working on?

Both Paul and I agree that prioritizing the implementation of algorithms is crucial.

I completely agree. However with your previous statement, do you mean that you'll be implementing all algorithms first in arbitrary-precision and then port them over to fixed?

philipc2 commented 1 year ago

Would we have to make two functions every time a new feature is implemented?

Most likely yes. Instead of branching within our Grid depending on whether we are in fixed or arbitrary precision world, we would have separate functions (and ideally a separate class) to handle the arbitrary-precision case.

While it may seem annoying, combining both into the same class and extending existing methods to branch off for arbitrary-precision would clutter up our implementations, make it less readable, and harder to debug.

I'm leaning towards subclassing our Grid object into a MultiPrecisionGrid, where by default all Grid methods would stay the same, however custom implementations can be written for methods that need to be changed.

Example

Any thoughts?

hongyuchen1030 commented 1 year ago

By "all functionalities", do you mean any new features that you are working on?

yes, any functionalities. latlonbox, conservtive/non-conservative zonal average, remapping and all related helper functions. But it will be very easy for your to rewrite the implementation into float operation.

hongyuchen1030 commented 1 year ago

By "all functionalities", do you mean any new features that you are working on?

yes, any functionalities. latlonbox, conservtive/non-conservative zonal average, remapping and all related helper functions. But it will be very easy for your to rewrite the implementation into float operation.

I completely agree. However with your previous statement, do you mean that you'll be implementing all algorithms first in arbitrary-precision and then port them over to fixed?

The arbitrary-precision one will be my priority. And I will try to port them over to fixed. But even if I couldn't, it will be very easy to convert them back to the float type.

hongyuchen1030 commented 1 year ago

By "all functionalities", do you mean any new features that you are working on?

yes, any functionalities. latlonbox, conservtive/non-conservative zonal average, remapping and all related helper functions. But it will be very easy for your to rewrite the implementation into float operation.

I completely agree. However with your previous statement, do you mean that you'll be implementing all algorithms first in arbitrary-precision and then port them over to fixed?

The arbitrary-precision one will be my priority. And I will try to port them over to fixed. But even if I couldn't, it will be very easy to convert them back to the float type.

Would we have to make two functions every time a new feature is implemented?

Most likely yes. Instead of branching within our Grid depending on whether we are in fixed or arbitrary precision world, we would have separate functions (and ideally a separate class) to handle the arbitrary-precision case.

While it may seem annoying, combining both into the same class and extending existing methods to branch off for arbitrary-precision would clutter up our implementations, make it less readable, and harder to debug.

I'm leaning towards subclassing our Grid object into a MultiPrecisionGrid, where by default all Grid methods would stay the same, however custom implementations can be written for methods that need to be changed.

Example

  • Let's say we subclass our Grid into a MultiPrecisionGrid
  • As mentioned above, by default all methods, attributes, and signatures would be idtentical
  • Instead of branching off inside of __init__, __from_verts__, __from_ds__, etc, a custom implemention can be written.
  • Inherited functions & attributes that don't necessary care whether our data is fixed or arbitrary (think our variable name dictionary and in the redesign the properties for accessing dimensions, connectivity variables, etc) could be kept the same.

Any thoughts?

That sounds good to me. But just to remind when we actually started to implement those algorithms, there will be lots of helper functions for calculation. And it might worth more thoughts about whether keep them overloaded with the multi-precision or write two versions of them

philipc2 commented 1 year ago

yes, any functionalities. latlonbox, conservtive/non-conservative zonal average, remapping and all related helper functions. But it will be very easy for your to rewrite the implementation into float operation.

I'd still strongly consider starting off with a floating point implementation so we can ship these features to the user, and then converting them to arbitrary-precision.

These types of functions and operators have been requested for a while as discussed in #46, so our priority should first be getting them implemented and tested using standard precision to guarantee they are operating as expected, and then as you mentioned if coverting between fixed and arbitrary is easy, you should be able get a multi-precision implemention in shortly after.

This would also allow you to use our standard, fixed-precision Grid, as a comparison for your testing, evaluation, and analysis of your work.

@erogluorhan @paullric

I'd like to hear your opinions on this and we can discuss this durring our meeting on Tuesday.

philipc2 commented 1 year ago

Additionally, implementing this features in fixed-precision first would be better from a review-standpoint, since not everyone here will be spun-up on everything related to arbitrary-precision.

Once we can confirm the functionality, design, and expected behavior in fixed-precision, converting and making it work in arbitrary-precision can be done according to your experience, design, and standards.

hongyuchen1030 commented 1 year ago

yes, any functionalities. latlonbox, conservtive/non-conservative zonal average, remapping and all related helper functions. But it will be very easy for your to rewrite the implementation into float operation.

I'd still strongly consider starting off with a floating point implementation so we can ship these features to the user, and then converting them to arbitrary-precision.

These types of functions and operators have been requested for a while as discussed in #46, so our priority should first be getting them implemented and tested using standard precision to guarantee they are operating as expected, and then as you mentioned if coverting between fixed and arbitrary is easy, you should be able get a multi-precision implemention in shortly after.

This would also allow you to use our standard, fixed-precision Grid, as a comparison for your testing, evaluation, and analysis of your work.

@erogluorhan @paullric

I'd like to hear your opinions on this and we can discuss this durring our meeting on Tuesday.

I think we're on the same page about the software delivery: We don't need to rush about the multi-precision.

And I have implemented these functionalities in floating points for a while (And I am recleaning it while waiting for the redesign). However, unfortunately, the float type implementation is not enough for my thesis requirement, and I have to reimplement it with the multi precision. In other words, for my thesis, only the multi-precision one counts.

And it will be really time-consuming for me to separate each of the functions into different PRs. Like I mentioned, I am lagging behind from my thesis for the developing work. And I am afraid I have to limit my involvement in the software development process to providing functional code that accurately represents the algorithms.

We can also discuss it in more detail during our Tuesday meeting. For now, I think the separate protected branch will be good enough for me as you mentioned in #321

philipc2 commented 1 year ago

separate protected branch will be good enough for me as you mentioned

I've made the branch protected to where it still requires two reviews, however we can flexible with it for the duration of your thesis work and Aaron's internship.

1 review should be either from Hongyu or Aaron, depending on who was the major contributor.

The second review should be from Rajeev or I as a second set of eyes, even though we aren't directly involved with the multi-precision work.

hongyuchen1030 commented 1 year ago

separate protected branch will be good enough for me as you mentioned

I've made the branch protected to where it still requires two reviews, however we can flexible with it for the duration of your thesis work and Aaron's internship.

1 review should be either from Hongyu or Aaron, depending on who was the major contributor.

The second review should be from Rajeev or I as a second set of eyes, even though we aren't directly involved with the multi-precision work.

thank you very much!