gaussian_smearing out of memory

BioFreak95 commented 5 years ago

Hey Guys, I know that in Issue #99 a similar problem was mentioned. I try to make some calculations with proteins. I use a cutoff of 5a with the pytorch-neighborlist. In average I get 1000 atoms with maximum neighbors of 60-70. So the matrix is 1000x65 in avg.

Well. In the gaussian smearing (ascf.py line 197) a diff-matrix is created. If one has a batch-size of 128 and 64 gaussians than this tensor has a shape of 128x1000x65x64 which can grow to 2-3GB memory usage for this tensor and i got matrices over 500kk entries. Than the gauss matrix is created with the similar shape. Then this two matrices fill our GTX1080 with 8GB memory.

Maybe you have an idea, how this can be done with less memory. For example do not safe the diff matrix, create the gauss matrix directly.

Even for small networks, this restricts the batchsize to be <= 32.

My tested networks have 20k-100k parameters. (With coupled_interactions) So not really big. I build a CNN for a similar task with over 2kk parameters and had less then 2GB memory consumption.

Maybe you have some ideas, which we can discuss. Maybe this could help also other people.

BTW. I have no problem to do the changes and make a PR. Just want do discuss this before.

ktschuett commented 5 years ago

We can not calculate the Gaussian from the distances in-place, since the computation graph is needed for calculating the gradients. As far as I see, you can only reduce the number of Gaussians (for 5A, 25 Gaussians should be fine), or the batch size (small batch_sizes may not be too bad). You could also use the DataParallel module from pytorch to parallelize over several GPUs.

A more advanced idea could be to find a better, smaller basis set for the distance occurring in your molecules a use that instead of the Gaussians.

BioFreak95 commented 5 years ago

if not centered:
    #coeff = -0.5 / torch.pow(widths, 2)
    # Use advanced indexing to compute the individual components
    #diff = distances[:, :, :, None] - offset[None, None, None, :]
    gauss = torch.exp((-0.5 / torch.pow(widths, 2)) * torch.pow((distances[:, :, :, None] - offset[None, None, None, :]),2))
else:
    #coeff = -0.5 / torch.pow(offset, 2)
    #diff = distances[:, :, :, None]
    gauss = torch.exp((-0.5 / torch.pow(offset, 2)) * torch.pow((distances[:, :, :, None]), 2))
#gauss = torch.exp(coeff * torch.pow(diff, 2))
return gauss

Would this be a problem? With this, one would only have one variable of size 3 GB and not two.

ktschuett commented 5 years ago

This would not make a difference to the memory requirements. You still calculate distances[:, :, :, None] - offset[None, None, None, :]. It does not matter to memory consumption whether you assign it to a variable or not.

Beyond that, you will have tensors of size 128x1000x65xn_features in every SchNet layer. So the Gaussians only consume a fraction of the memory that the other activations do.

BioFreak95 commented 5 years ago

Ah sure. You are right. I do not know, why I did not saw this... Logically it is the same. Sorry... :laughing: Well in this case... Seems like I have to stay on my accumulating gradient :/

How can I find out the number of features? Thought the number of features is max_z in the embedding. So 100 in default. But maybe I missunderstood something!

ktschuett commented 5 years ago

Ah sure. You are right. I do not know, why I did not saw this... Logically it is the same. Sorry... Well in this case... Seems like I have to stay on my accumulating gradient :/

Why do you need to accumulate? Does SGD with small batches not work?

How can I find out the number of features? Thought the number of features is max_z in the embedding. So 100 in default. But maybe I missunderstood something!

I mean the n_atom_basis you pass to the SchNet class.

BioFreak95 commented 5 years ago

I am using ADAM and with batches bigger 4 I get out of memory. The loss is fluctuating very strong. So I think, the batch_size is to small. After using accumulating gradient the loss is much more stable. I have 128 filter and 128 atom_basis. In the environment i have cutoff 5.0, same for the cutoff in the schnet with cosine. 4 Interaction-Layer and 50 gaussians. -> So I thought this is a medium-sized network :P

Actually I have two problems. The first is, that sometimes i got nans in the loss after some epochs. -> Did you had something similar in the past? Second is, that higher numbers of parameters leads to overfit and lower leads to high train and val errors... -> So lower the atom_basis or filter_number seems not as a good idea.

Ah I see. Thank you for this hint. But yeah, then this matrices are bigger.

ktschuett commented 5 years ago

I would start with much smaller networks and go larger from there. Try to set atom_basis & filter to 50 or so and perhaps interactions layers to 2 or even 1, just until it trains properly.

The fluctuation of the loss is a problem you tend to have with energies. I think it is because the assignment of energies to atoms needs to be learned. So if you have a lot of atoms, this gets even more complicated. It will be better, if you use forces during training, since they contain local information.

Also, you could try using vanilla SGD, if ADAM has memory issues. I am to this day not completely sure how much ADAM really helps, especially in the later stages of training...

BioFreak95 commented 5 years ago

OK I will try both of your tips. Thank you very much for your help and your time and sorry, that the issue was not really necessary.

I do not learning energies. I am trying to learn chemical properties of Proteins. So it is a complete different problem. I am actually not know, if this is working at all. But to be rotational invariant should theoretically be an improvement to CNNs.

atomistic-machine-learning / schnetpack

gaussian_smearing out of memory #162