CogVideoX-5B-I2V raises error

THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Apache License 2.0

9.4k stars 887 forks source link

CogVideoX-5B-I2V raises error #480

Open yjhong89 opened 2 weeks ago

yjhong89 commented 2 weeks ago

System Info / 系統信息

When running sample_video.py with CogVideoX-5B-I2V model, error raised.

patch_size

since patch_size in config is integer, error occured in here: https://github.com/THUDM/CogVideo/blob/e2987ff565703953b34749db2d1053e26bba2e2c/sat/dit_video_concat.py#L664
after fixing patch_size as [2,2,2] following load_checkpoint error occured

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Running sample_video.py with CogVideoX-5B-I2V model

Expected behavior / 期待表现

Load_checkpoint error

anpwu commented 2 weeks ago

Same Error in patch_size: [rank0]: self.spatial_length = latent_width * latent_height // reduce(mul, patch_size[1:]) [rank0]: TypeError: 'int' object is not subscriptable

anpwu commented 2 weeks ago

Another Error is: [rank0]: File "CogVideo/sat/sgm/util.py", line 261, in instantiate_from_config [rank0]: return get_obj_from_str(config["target"])(config.get("params", dict()), extra_kwargs) [rank0]: TypeError: dit_video_concat.Basic3DPositionEmbeddingMixin() got multiple values for keyword argument 'height_interpolation'

csf0429 commented 2 weeks ago

Same problem after merging CogVideo1.5X branch. It seems that this update cause the prolems:https://github.com/THUDM/CogVideo/commit/3a9af5bdd937faaa7803914aa94e384d7f40af67#diff-7b0a094155a48dc8761cbfd20bdf64fe0cc9021873b7ab23cea5a5cae38a670e

QiqLiang commented 1 week ago

Same problem after merging CogVideo1.5X branch. It seems that this update cause the prolems:3a9af5b#diff-7b0a094155a48dc8761cbfd20bdf64fe0cc9021873b7ab23cea5a5cae38a670e

Hi, did you solve it? Could you tell me how to modify the code to run? Thanks a lot.

AlphaNext commented 4 days ago

@yjhong89 Hi, have a simple question, how to prepare dataset for image to video (I2V) fine-tuning, thanks.