CumbyLab / gridrdf

Code for calculating grouped representation of interatomic distances (GRID) from crystal structures, and applying this in machine learning models.
MIT License
12 stars 3 forks source link

Consider using `dist-matrix` #1

Open sgbaird opened 2 years ago

sgbaird commented 2 years ago

Fast Numba-enabled CPU and GPU computations of Earth Mover's (scipy.stats.wasserstein_distance) and Euclidean distances.

https://github.com/sparks-baird/dist-matrix

jcumby commented 2 years ago

Looks really useful, thanks Sterling! I'll definitely add it to the development list, but it might take a while to get there...

sgbaird commented 1 year ago

Computing distance matrix for Matbench log K_VRH bulk modulus on a single core (estimated ~18 days):

similarity matrix rows:   1%| | 109/10987 [4:24:07<437:24:36]

Related: #4

jcumby commented 1 year ago

I've implemented a simple speedup in c39e350 based on treating the 2D EMD problem as a larger 1D problem with a modified distance metric. This gives a fairly decent speedup (~5-fold on a 100-shell test set), and can be further improved slightly by storing the GRID arrays as scipy.sparse.coo_matrix objects.

I hope to incorporate dist_matrix in the future, but I think the code will need some serious rearrangement to make it work efficiently - I'll keep you posted!

sgbaird commented 1 year ago

I came across a scholar link for v0.3.0. Nice job on the speed-up!

Major updates to speed of GRID computation and EMD distance calculations, particularly incorporating Numba optimisations of EMD calculation. This reduces the calculation time significantly. For example, a 12k x 12k EMD matrix (based on 100 GRID shells) previously took ca. 30 days, but is now reduced to 30 minutes on a 6-core desktop machine. There have also been improvements to the overall code structure-please see the CHANGELOG and commit history for more details.