Error when trying to train the prior

iamzoltan commented 2 years ago

Hey again,

I just finished training and exporting a new model, but I cant seem to get it to train the prior. I am getting the following error when exporting the model:

/home/user/code/RAVE/env3.9/lib/python3.9/site-packages/pytorch_lightning/core/saving.py:217: UserWarning: Found keys that are not in the model state dict but in the checkpoint: ['decoder.net.2.net.0.aligned.paddings.0.pad', 'decoder.net.2.net.0.aligned.paddings.1.pad', 'decoder.net.2.net.1.aligned.paddings.0.pad', 'decoder.net.2.net.1.aligned.paddings.1.pad', 'decoder.net.2.net.2.aligned.paddings.0.pad', 'decoder.net.2.net.2.aligned.paddings.1.pad', 'decoder.net.4.net.0.aligned.paddings.0.pad', 'decoder.net.4.net.0.aligned.paddings.1.pad', 'decoder.net.4.net.1.aligned.paddings.0.pad', 'decoder.net.4.net.1.aligned.paddings.1.pad', 'decoder.net.4.net.2.aligned.paddings.0.pad', 'decoder.net.4.net.2.aligned.paddings.1.pad', 'decoder.net.6.net.0.aligned.paddings.0.pad', 'decoder.net.6.net.0.aligned.paddings.1.pad', 'decoder.net.6.net.1.aligned.paddings.0.pad', 'decoder.net.6.net.1.aligned.paddings.1.pad', 'decoder.net.6.net.2.aligned.paddings.0.pad', 'decoder.net.6.net.2.aligned.paddings.1.pad', 'decoder.net.8.net.0.aligned.paddings.0.pad', 'decoder.net.8.net.0.aligned.paddings.1.pad', 'decoder.net.8.net.1.aligned.paddings.0.pad', 'decoder.net.8.net.1.aligned.paddings.1.pad', 'decoder.net.8.net.2.aligned.paddings.0.pad', 'decoder.net.8.net.2.aligned.paddings.1.pad', 'decoder.synth.paddings.0.pad', 'decoder.synth.paddings.1.pad', 'decoder.synth.paddings.2.pad'] rank_zero_warn(

any ides?

caillonantoine commented 2 years ago

I'm working on a fix for this problem ! As a temporary workaround you can convert your checkpoint using this function:

import torch
import os

def convert_checkpoint(ckpt_path: str):
    """
    Remove pad buffers from a checkpoint and save
    the new converted checkpoint in the same folder
    """
    ckpt = torch.load(ckpt_path)
    keys = filter(lambda n: "pad" not in n, ckpt["state_dict"].keys())
    ckpt["state_dict"] = {k: ckpt["state_dict"][k] for k in keys}
    target = os.path.join(os.path.dirname(ckpt_path), "converted.ckpt")
    torch.save(ckpt, target)

# FOR EXAMPLE
# convert_checkpoint("runs/ljspeech/rave/version_0/checkpoints/best.ckpt")

It should work for #53 too. I'll update this issue when the final fix is ready !

iamzoltan commented 2 years ago

thanks, ill give it a shot

iamzoltan commented 2 years ago

that worked for the export, thanks! But now trying to train the prior, I get the following:

File "/home/user/code/RAVE/prior/model.py", line 108, in split_classes x = x.reshape(x.shape[0], x.shape[1], self.data_size, -1) RuntimeError: cannot reshape tensor of 0 elements into shape [8, 0, 128, -1] because the unspecified dimension size -1 can be any value and is ambiguous

not sure how to approach this. I initially got an error complaining about a division, but replaced the // with torch.div(a, b, rounding_mode="floor"). not sure if that was right, but that error was there in addition to the above.

NOTE: I tried to sort out the empty tensor being passed to split_classes from validation_step and I also tried to change the reshaping in the split_classes function itself; both lead to this error:

  File "/home/user/code/RAVE/prior/model.py", line 164, in validation_epoch_end
    y = self.decode(z)
  File "/home/user/code/RAVE/env3.9/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/code/RAVE/prior/model.py", line 76, in decode
    return self.synth.decode(z)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__.py", line 67, in decode
      latent_pca0 = self.latent_pca
      _9 = torch.unsqueeze(torch.numpy_T(latent_pca0), -1)
      z6 = torch.conv1d(z4, _9)
           ~~~~~~~~~~~~ <--- HERE
      latent_mean0 = self.latent_mean
      z7 = torch.add(z6, torch.unsqueeze(latent_mean0, -1))

Traceback of TorchScript, original code (most recent call last):
  File "/home/user/code/RAVE/export_rave.py", line 187, in decode

        if not self.trained_cropped:  # PERFORM PCA AFTER PADDING
            z = nn.functional.conv1d(z, self.latent_pca.T.unsqueeze(-1))
                ~~~~~~~~~~~~~~~~~~~~ <--- HERE
            z = z + self.latent_mean.unsqueeze(-1)

RuntimeError: Calculated padded input size per channel: (0). Kernel size: (1). Kernel size can't be greater than actual input size

any ideas?

caillonantoine commented 2 years ago

Should be fixed in d9f55f59627ea02689578d0ddae05b420be8d4d2 can you check ?

caillonantoine commented 2 years ago

By the way I'm closing since it's a duplicate of #45

iamzoltan commented 2 years ago

sounds good, ill will check shortly once this current model is done.

It seems the smaller model successfully got to stage 2, although it looks as though the loss is increasing, is this normal? and how long should one train in stage 2?

acids-ircam / RAVE

Error when trying to train the prior #55