SINGROUP / dscribe

DScribe is a python package for creating machine learning descriptors for atomistic systems.
https://singroup.github.io/dscribe/
Apache License 2.0
400 stars 89 forks source link

Numpy operations on sparsed derivatives #124

Open VondrakMar opened 1 year ago

VondrakMar commented 1 year ago

May I have a question about math operation on the sparse output of the descriptor, please? I would like to do numpy operations on the sparse output. E.g., I want to do np.moveaxis, and einstein summation (attached code is creating derivatives of normalized soap).

As far as I know, this is not possible with the sparse output, and I would have to create dense array with .todense()? Or is there a way how to do a math operations?

At the end I am creating a neighbors list and want to do a matrix multiplication w.r.t. my pair list, so I would really like to keep a sparse representation until this moment, since my memory is exploding with a dense one.

Thank you for your answer Best, Martin

der, desc = soap.derivatives(molecule,attach=True) 
norms = np.linalg.norm(desc,axis=-1)
norm_desc = desc/norms[:,None]
norms2 = norms**2
deriv_final = np.moveaxis(der,[0,1],[2,0])
# normal_soap_vector = np.array(normal_soap_vector)
hold = (np.einsum('ij,arkj->arik',desc,deriv_final,optimize="greedy"))
vd = np.einsum('akii->aki',hold,optimize="greedy")
r1 = np.einsum('akl,lj,l->aklj',vd,desc,1/norms,optimize="greedy")
f1 = np.einsum('aklj,l->aklj',deriv_final,norms,optimize="greedy")
normDer = np.einsum('aklj,l->aklj',f1-r1,1/norms2,optimize="greedy")
lauri-codes commented 1 year ago

Hi @VondrakMar!

We use the sparse library for the multi-dimensional sparse arrays. You can take a look at their API to see what kind of operations are supported: I do see at least einsum and moveaxis there, but I have not tried myself.

Another option is to convert to other formats for the operations. Converting to numpy arrays may only be possible for parts of the data, but if you can work with 2D slices, then you can also use scipy.sparse to perform certain tasks.

There are some additional details in our docs.

VondrakMar commented 1 year ago

Hey @lauri-codes, really sorry for a late reply.

I rewrote my code with a sparse library. However, are derivatives transformed to the sparse output in one go AFTER all of them are computed?

I mean, if my derivatives of 1 system are filling my memory even before finishing their computation, will they fill the memory the same way even with sparse = True?

(I went very briefly through the code, and I think this is what is happening? But I am really not sure...)

Thank you again for your answer, Best, Martin

lauri-codes commented 1 year ago

Hi @VondrakMar:

Good question: The implementation is such that at most we keep the dense form for one system at a time per process. So when using multiple systems, each system gets translated to the sparse form when stored to the final result before moving on to the next system.

It would be possible to use the sparse form throughout the process, but this would be quite tedious to implement, as the sparse arrays are not a simple drop-in replacement on the C++ side. I guess you are hitting the limits of your system memory: how big systems are you dealing with and how many n_jobs are you using?

VondrakMar commented 1 year ago

Hey @lauri-codes I have ~200 atoms. I am using 144 GB or 240 GB of RAM (both are too small). I am calculating only 1 system, so n_job = 1.

However, I am pretty sure that it should be possible to fit sparsed derivatives into the memory.

Now I think that derivatives_single is what I should use.

Just calculate neighbor list with cutoff extended by get_cutoff_padding and then use this method for calculating only non-zero entries...

However, thank you for your answer, Best, Martin

VondrakMar commented 1 year ago

Hey @lauri-codes, I am not sure, if I should open a new issue, or just write it here, but...

I did change my code to use derivatives_single, however for some reason that I don't understand, attach = True is not working as it should. I am attaching a code.

import numpy as np
from ase.build import bulk,make_supercell
from dscribe.descriptors import SOAP

a2 = bulk('Cu', 'fcc', a=3.6, orthorhombic=True)
newMol = make_supercell(a2,[[5,0,0],[0,5,0],[0,0,5]])
soap = SOAP(
    n_max = 4,
    l_max = 3,
    r_cut = 2,
    sigma = 1,
    periodic=True,
    species=["Cu"]
derW,descW = soap.derivatives(newMol,attach=True)
der,desc = soap.derivatives_single(newMol,centers=[5],indices=[4],attach=True)
print(der[0][0][0]/derW[5][4][0]) # 1s
der,desc = soap.derivatives_single(newMol,centers=[5],indices=[5],attach=True)
print(der[0][0][0]/derW[5][5][0]) # not 1s
derW,descW = soap.derivatives(newMol,attach=False)
der,desc = soap.derivatives_single(newMol,centers=[5],indices=[5],attach=True)
print(der[0][0][0]/derW[5][5][0]) # 1s again

derW,descW = soap.derivatives(newMol,attach=True)
der,desc = soap.derivatives_single(newMol,centers=[0],indices=[0],attach=True)
print(der[0][0][0]/derW[0][0][0]) # 1s again

If indices and centers are the same (and are not 0s), it is calculated as if attach = False. Do you have a idea how it can be repair, or what I am doing wrong? Also, I think this is happening only for periodic systems. Thank you, Best, Martin