UXARRAY / uxarray

Xarray extension for unstructured climate and global weather data analysis and visualization.
https://uxarray.readthedocs.io/
Apache License 2.0
151 stars 31 forks source link

The compatibility between Numba and Python vectorization #202

Closed hongyuchen1030 closed 3 weeks ago

hongyuchen1030 commented 1 year ago

I noticed that for the helper function in uxarray/helpers.py, we are using the Numba, which, according to its documentation, prefer codes writing using "non python" styles (like loop) and lack supports for some features (like nested array)

Numba generates optimized machine code from pure Python code using the [LLVM compiler infrastructure](http://llvm.org/). With a few simple annotations, array-oriented and math-heavy Python code can be just-in-time optimized to performance similar as C, C++ and Fortran, without having to switch languages or Python interpreters.

However, at our upper levels like uxarray/grid.py, we are using the python vectorization to boost our performance: map functions, Ndarray manipulation, and so on, which seems like the opposite direction of the Numba.

I wonder if these two methods can be compatible with each other, in other words, will the mixing of these two styles slow down the performance?

erogluorhan commented 1 year ago

This is a great question. Let us investigate this. Pinging @anissa111 @rajeeja

dcherian commented 1 year ago

In general, if you vectorize numpy, you end up running most things in the C layer and numba doesn't give you a computation speedup. However, numba will remove a lot of temporary memory copies, and reduce intermediate memory allocations. This brings a giant speedup usually.

Since you've already made the choice to choose numba, I would optimize for readability and easy development first. And then use numba on the hot paths.

rajeeja commented 1 year ago

In general, if you vectorize numpy, you end up running most things in the C layer and numba doesn't give you a computation speedup. However, numba will remove a lot of temporary memory copies, and reduce intermediate memory allocations. This brings a giant speedup usually.

Since you've already made the choice to choose numba, I would optimize for readability and easy development first. And then use numba on the hot paths.

@dcherian I mostly agree with your comments and would like to keep numba as I have seen good improvements in the integrate functionality.

There were a few versioning and compatibility issues with numba that caused a bit of pain.

@hongyuchen1030 do you think we can match the optimization that numba provides with the changes you propose?

hongyuchen1030 commented 1 year ago

In general, if you vectorize numpy, you end up running most things in the C layer and numba doesn't give you a computation speedup. However, numba will remove a lot of temporary memory copies, and reduce intermediate memory allocations. This brings a giant speedup usually. Since you've already made the choice to choose numba, I would optimize for readability and easy development first. And then use numba on the hot paths.

@dcherian I mostly agree with your comments and would like to keep numba as I have seen good improvements in the integrate functionality.

There were a few versioning and compatibility issues with numba that caused a bit of pain.

@hongyuchen1030 do you think we can match the optimization that numba provides with the changes you propose?

Since I didn't use python a lot and didn't do data analysis much either (I usually cope with C/C++ with algorithm implementations), I am not very sure about the details of numba and "numpy style vectorization"

But according to my observation and numba documentation, numba is idea for looping-based data analysis: basically, if we want to do some iterative linear algebra function calls, numba is a good tool.

However, the downside is: numba is built for numerical calculation, so it's not compatible with some complicated data structure (like a nested array), and we can only use the "numba-supported" numpy function (which are limited selections). From my knowledge and experience, numba doesn't support "np.dot()" and some other np function we are going to use

My current algorithm implementation is all looping-based(good news for numba) but it also uses a lots of functionality that numba might not support. Although it's possible to vectorize some of them, there're still two things we need to be careful about:

  1. What's the actual speedup for numba vs np vectorization in our algorithm?

  2. We're dealing with the geometry, and all algorithms are based on index-based looping (We need to look into each face to do the analysis). What's the possibility and the robustness of using the numpy vectorization here?

rajeeja commented 4 months ago

We have remove a bunch of Numba, the only places it remains is in grid/ geometry, neighbors and connectivity - all mostly non-nested functions.