Open sgbaird opened 2 years ago
Looks really useful, thanks Sterling! I'll definitely add it to the development list, but it might take a while to get there...
Computing distance matrix for Matbench log K_VRH bulk modulus on a single core (estimated ~18 days):
similarity matrix rows: 1%| | 109/10987 [4:24:07<437:24:36]
Related: #4
I've implemented a simple speedup in c39e350 based on treating the 2D EMD problem as a larger 1D problem with a modified distance metric. This gives a fairly decent speedup (~5-fold on a 100-shell test set), and can be further improved slightly by storing the GRID arrays as scipy.sparse.coo_matrix
objects.
I hope to incorporate dist_matrix in the future, but I think the code will need some serious rearrangement to make it work efficiently - I'll keep you posted!
I came across a scholar link for v0.3.0
. Nice job on the speed-up!
Major updates to speed of GRID computation and EMD distance calculations, particularly incorporating Numba optimisations of EMD calculation. This reduces the calculation time significantly. For example, a 12k x 12k EMD matrix (based on 100 GRID shells) previously took ca. 30 days, but is now reduced to 30 minutes on a 6-core desktop machine. There have also been improvements to the overall code structure-please see the CHANGELOG and commit history for more details.
https://github.com/sparks-baird/dist-matrix