activation setting error

euhruska commented 5 years ago

I'm setting the last layer to Tanh, just for testing purposes instead of None like: LinearLayer(width_layer,1,activation=nn.Tanh()) and get an error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-100-bbcb6eb855db> in <module>
     19         optimizer.zero_grad()
---> 20         pred_energy, pred_force= chln_net.forward(train_dict['traj'])

/scratch1/eh22/cgnet/nnet.py in forward(self, coord)
    295         force = torch.autograd.grad(-torch.sum(energy),
    296                                     coord,
--> 297                                     create_graph=True, retain_graph=True)
    298         return energy, force[0], energy_raw
    299 

/scratch1/eh22/conda3/envs/py31/lib/python3.6/site-packages/torch/autograd/__init__.py in grad(outputs, inputs, grad_outputs, retain_graph, create_graph, only_inputs, allow_unused)
    147     return Variable._execution_engine.run_backward(
    148         outputs, grad_outputs, retain_graph, create_graph,
--> 149         inputs, allow_unused)
    150 
    151 

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1000, 1]], which is output 0 of TanhBackward, is at version 3; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Setting torch.autograd.set_detect_anomaly(True) gives tanh_error.txt

nec4 commented 5 years ago

Thanks for the error trace! Will look into it.

nec4 commented 5 years ago

I have also reproduced this same error on the master training example notebook. Likely it is something in CGnet.forward(). Will continue to look.

nec4 commented 5 years ago

When using nn.ReLU() everything runs fine. Looking at the docs, The difference between nn.ReLU() and other activation modules is that is is implemented out-of-place by default:

[docs]@weak_module
class ReLU(Module):
    r"""Applies the rectified linear unit function element-wise:

    :math:`\text{ReLU}(x)= \max(0, x)`

    Args:
        inplace: can optionally do the operation in-place. Default: ``False``

    Shape:
        - Input: :math:`(N, *)` where `*` means, any number of additional
          dimensions
        - Output: :math:`(N, *)`, same shape as the input

    .. image:: scripts/activation_images/ReLU.png

    Examples::

        >>> m = nn.ReLU()
        >>> input = torch.randn(2)
        >>> output = m(input)

      An implementation of CReLU - https://arxiv.org/abs/1603.05201

        >>> m = nn.ReLU()
        >>> input = torch.randn(2).unsqueeze(0)
        >>> output = torch.cat((m(input),m(-input)))
    """
    __constants__ = ['inplace']

    def __init__(self, inplace=False):
        super(ReLU, self).__init__()
        self.inplace = inplace

    @weak_script_method
    def forward(self, input):
        return F.relu(input, inplace=self.inplace)

    def extra_repr(self):
        inplace_str = 'inplace' if self.inplace else ''
        return inplace_str

However, nn.Tanh() is implemented as out-of-place always. It must be something else.

nec4 commented 5 years ago

Found: the addition of the priors was introducing an inplace operation through the following blocks in nnet.py:

141         if self.priors:
142             for prior in self.priors:
143                 energy += prior(feat[:, prior.feat_idx])

PyTorch treats operations like += as inplace, so this should be fixed by replacing line 143 with

143                 energy = energy + prior(feat[:, prior.feat_idx])

brookehus commented 5 years ago

Oh yeah, I ran into this once and it was super confusing. I'll add it to #68.

nec4 commented 5 years ago

This has been fixed in https://github.com/coarse-graining/cgnet/commit/a055c8ab94a81bbd7da20587dc7116c194830b3b and is now on master.

coarse-graining / cgnet

activation setting error #63