AI4HealthUOL / SSSD

Repository for the paper: 'Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models'
MIT License
270 stars 47 forks source link

Cannot load model. #23

Closed fd-guo closed 4 months ago

fd-guo commented 5 months ago

Thanks for sharing the interesting paper and releasing the code!!!

I trained a small model with 1k points from PTB-XL dataset. The training process goes smoothly. However, when I load the model in inference.py, I am hitting the following error. Can you let me know how to fix it? Thanks!

No valid model found
  File "/home/ec2-user/SSSD/src/inference.py", line 83, in generate
    net.load_state_dict(checkpoint['model_state_dict'])
RuntimeError: Error(s) in loading state_dict for SSSDS4Imputer:
    size mismatch for residual_layer.residual_blocks.0.S41.s4_layer.kernel.kernel.omega: copying a param with shape torch.Size([201, 2]) from checkpoint, the shape in current model is torch.Size([51, 2]).
    size mismatch for residual_layer.residual_blocks.0.S41.s4_layer.kernel.kernel.z: copying a param with shape torch.Size([201, 2]) from checkpoint, the shape in current model is torch.Size([51, 2]).
    While copying the parameter named "residual_layer.residual_blocks.0.S41.s4_layer.kernel.kernel.B", whose dimensions in the model are torch.Size([1, 512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([1, 512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    While copying the parameter named "residual_layer.residual_blocks.0.S41.s4_layer.kernel.kernel.P", whose dimensions in the model are torch.Size([1, 512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([1, 512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    While copying the parameter named "residual_layer.residual_blocks.0.S41.s4_layer.kernel.kernel.w", whose dimensions in the model are torch.Size([512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    size mismatch for residual_layer.residual_blocks.0.S42.s4_layer.kernel.kernel.omega: copying a param with shape torch.Size([201, 2]) from checkpoint, the shape in current model is torch.Size([51, 2]).
    size mismatch for residual_layer.residual_blocks.0.S42.s4_layer.kernel.kernel.z: copying a param with shape torch.Size([201, 2]) from checkpoint, the shape in current model is torch.Size([51, 2]).
    While copying the parameter named "residual_layer.residual_blocks.0.S42.s4_layer.kernel.kernel.B", whose dimensions in the model are torch.Size([1, 512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([1, 512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    While copying the parameter named "residual_layer.residual_blocks.0.S42.s4_layer.kernel.kernel.P", whose dimensions in the model are torch.Size([1, 512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([1, 512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    While copying the parameter named "residual_layer.residual_blocks.0.S42.s4_layer.kernel.kernel.w", whose dimensions in the model are torch.Size([512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    size mismatch for residual_layer.residual_blocks.1.S41.s4_layer.kernel.kernel.omega: copying a param with shape torch.Size([201, 2]) from checkpoint, the shape in current model is torch.Size([51, 2]).
    size mismatch for residual_layer.residual_blocks.1.S41.s4_layer.kernel.kernel.z: copying a param with shape torch.Size([201, 2]) from checkpoint, the shape in current model is torch.Size([51, 2]).
    While copying the parameter named "residual_layer.residual_blocks.1.S41.s4_layer.kernel.kernel.B", whose dimensions in the model are torch.Size([1, 512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([1, 512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    While copying the parameter named "residual_layer.residual_blocks.1.S41.s4_layer.kernel.kernel.P", whose dimensions in the model are torch.Size([1, 512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([1, 512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    While copying the parameter named "residual_layer.residual_blocks.1.S41.s4_layer.kernel.kernel.w", whose dimensions in the model are torch.Size([512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    size mismatch for residual_layer.residual_blocks.1.S42.s4_layer.kernel.kernel.omega: copying a param with shape torch.Size([201, 2]) from checkpoint, the shape in current model is torch.Size([51, 2]).
    size mismatch for residual_layer.residual_blocks.1.S42.s4_layer.kernel.kernel.z: copying a param with shape torch.Size([201, 2]) from checkpoint, the shape in current model is torch.Size([51, 2]).
    While copying the parameter named "residual_layer.residual_blocks.1.S42.s4_layer.kernel.kernel.B", whose dimensions in the model are torch.Size([1, 512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([1, 512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    While copying the parameter named "residual_layer.residual_blocks.1.S42.s4_layer.kernel.kernel.P", whose dimensions in the model are torch.Size([1, 512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([1, 512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).
    While copying the parameter named "residual_layer.residual_blocks.1.S42.s4_layer.kernel.kernel.w", whose dimensions in the model are torch.Size([512, 32, 2]) and whose dimensions in the checkpoint are torch.Size([512, 32, 2]), an exception occurred : ('unsupported operation: more than one element of the written-to tensor refers to a single memory location. Please clone() the tensor before performing the operation.',).

During handling of the above exception, another exception occurred:

  File "/home/ec2-user/SSSD/src/inference.py", line 86, in generate
    raise Exception('No valid model found')
  File "/home/ec2-user/SSSD/src/inference.py", line 208, in <module>
    generate(**gen_config,
Exception: No valid model found
juanlopezcode commented 4 months ago

Hi, it seems that there is a missmatch in the s4_lmax parameter, make sure to set it to the length of your time series