I got an error when executing train.py and train_svd.py using a png image.
Traceback (most recent call last):
File "/root/animate-anything/train.py", line 1167, in <module>
main_eval(**args_dict)
File "/root/animate-anything/train.py", line 1154, in main_eval
batch_eval(unet, text_encoder, vae, vae_processor, lora_manager, pretrained_model_path,
File "/root/animate-anything/train.py", line 1114, in batch_eval
precision = eval(pipeline, vae_processor,
File "/root/animate-anything/train.py", line 1033, in eval
input_image_latents = tensor_to_vae_latent(input_image, vae)
File "/root/animate-anything/train.py", line 365, in tensor_to_vae_latent
latents = vae.encode(t).latent_dist.sample()
File "/root/animate-anything/venv/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "/root/animate-anything/venv/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 259, in encode
h = self.encoder(x)
File "/root/animate-anything/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/animate-anything/venv/lib/python3.10/site-packages/diffusers/models/vae.py", line 141, in forward
sample = self.conv_in(sample)
File "/root/animate-anything/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/animate-anything/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/root/animate-anything/venv/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [128, 3, 3, 3], expected input[1, 4, 512, 512] to have 3 channels, but got 4 channels instead
I got an error when executing train.py and train_svd.py using a png image.
This is caused by the png image may contain 4 channels. I think the mode of the input image can convert from RGBA to RGB for solve this error, just like svd https://github.com/Stability-AI/generative-models/blob/main/scripts/sampling/simple_video_sample.py#L97.