ACEsuit / Polynomials4ML.jl

Polynomials for ML: fast evaluation, batching, differentiation
MIT License
12 stars 5 forks source link

On-the-fly Spherical Harmonics #66

Closed cortner closed 9 months ago

cortner commented 11 months ago

Is there a potential advantage in evaluating spherical harmonics on-the-fly from analytic expressions instead of recursively? It means zero-memory usage. Most likely for large lmax no, but we never go that high. And for ENN architectures, one sticks with extremely low degrees. In general, there could be a huge advantage for multi-threading.

An optimized basis code could be code-generated using StaticPolynomials.jl

https://en.wikipedia.org/wiki/Table_of_spherical_harmonics

CC @ilyes319

cortner commented 11 months ago

My very first test with just L = 3:

# symbolic R-Ylm and using only static arrays
3ns 
# in-place preallocated ... `evaluate!`
107.422 ns

I know this sounds completely ridiculous (constant propagation??), but I tested the result by evaluating it for randomly generated 1000 inputs, and I get almost exactly 3µs. I'm still suspicious but I'm not sure what to test next.

No storage or anything required, the basis functions could simply be evaluated on the fly (which I haven't really tested yet)

ilyes319 commented 11 months ago

That is very impressive!

How many edges do you have? Do you expect it would get even better as the number of edges increases?

cortner commented 11 months ago

this is just a raw micro-benchmark of the Ylm basis. Hard to say how it would affect the performance of the entire architecture

cortner commented 11 months ago

@CheukHinHoJerry this might interest you as well...

CheukHinHoJerry commented 11 months ago

The speed up is impressive - though I am not certain about this, my guess is that this might not changes a lot since the bottleneck usually comes from the correlation of features from my previous experience?

Would be great if @ilyes319 could let me know about this - do we still have the same bottleneck under some standard hyperparameters in MACE/general graph architectures? Or would the most costly part becomes the embedding after applying tensor contractions and keep the dimension low?

cortner commented 11 months ago

Hi Jerry - the bottleneck is different for different models. For some small ACE models that use low correlation order but comparatively large L the Ylms do end up being the bottleneck.

There is also a second aspect here. If we can manage to make this evaluation this fast but on-the-fly without storage (non-trivial!) then might reduce the memory footprint of a GNN in helpful ways.

cortner commented 11 months ago

I suggest to not rush into this now. It is not urgent for us, but Ilyes and I briefly discussed it, so I got curious and wanted to try it out.

cortner commented 9 months ago

this is superceded by #74