Stability-AI / generative-models

Generative Models by Stability AI
MIT License
23.96k stars 2.67k forks source link

Stable Video Diffusion seems to have problem with reflections #158

Open alexfredo opened 9 months ago

alexfredo commented 9 months ago

I made few tests with SVD and I noticed that reflections seems to stay too much at the same place, I'm not sure because I don't have generated long enough videos, did someone else noticed problem with relfections not reacting correctly ? here some tests I made :

https://github.com/Stability-AI/generative-models/assets/24534698/3b102ae4-39ee-4e7e-9478-e96bff609e0e

https://github.com/Stability-AI/generative-models/assets/24534698/98bd3a11-5767-493d-8036-9832f564ecf8

https://github.com/Stability-AI/generative-models/assets/24534698/c53dc9cf-9b35-42d3-ade6-0e9a99d8b42d

YooZF commented 9 months ago

How did you do that? I can't even generate it right

alexfredo commented 9 months ago

@YooZF I choose "svd" for the model and put 1 for the parameter "Decode t frames at a time" In general I need to try at least 3 differents seed before get a correct result sometime nothing happen and after I changed the seed I got a good result but there's often something weird about the reflections

NicerY commented 9 months ago

Can you provide some log output? My terminal shut down after showing the following, I'm not sure if it was out of memory that caused it to shut down automatically.

`PS F:\shared_pc3\generative-models-main> streamlit run f:/shared_pc3/generative-models-main/scripts/demo/video_sampling.py

You can now view your Streamlit app in your browser.

Local URL: http://localhost:8501/ Network URL: http://192.168.168.158:8501/

E:\anaconda\envs\svd\lib\site-packages\streamlit\watcher\local_sources_watcher.py:177: UserWarning: Torchaudio's I/O functions now support par-call bakcend dispatch. Importing backend implementation directly is no longer guaranteed to work. Please use backend keyword with load/save/info function, instead of calling the udnerlying implementation directly. lambda m: [p for p in m.path._path], VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing Initialized embedder #0: FrozenOpenCLIPImagePredictionEmbedder with 683800065 params. Trainable: False Initialized embedder https://github.com/Stability-AI/generative-models/pull/1: ConcatTimestepEmbedderND with 0 params. Trainable: False Initialized embedder https://github.com/Stability-AI/generative-models/issues/2: ConcatTimestepEmbedderND with 0 params. Trainable: False Initialized embedder https://github.com/Stability-AI/generative-models/pull/3: VideoPredictionEmbedderWithEncoder with 83653863 params. Trainable: False Initialized embedder https://github.com/Stability-AI/generative-models/issues/4: ConcatTimestepEmbedderND with 0 params. Trainable: False Loading model from checkpoints/svd.safetensors PS F:\shared_pc3\generative-models-main>`

alexfredo commented 9 months ago

@NicerY Here my log :

A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing VideoTransformerBlock is using checkpointing Initialized embedder #0: FrozenOpenCLIPImagePredictionEmbedder with 683800065 params. Trainable: False Initialized embedder #1: ConcatTimestepEmbedderND with 0 params. Trainable: False Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False Initialized embedder #3: VideoPredictionEmbedderWithEncoder with 83653863 params. Trainable: False Initialized embedder #4: ConcatTimestepEmbedderND with 0 params. Trainable: False Loading model from checkpoints/svd_image_decoder.safetensors 2023-11-23 01:11:17.828 Uncaught app exception Traceback (most recent call last): File "C:\generative-models.pt2\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 534, in _run_script exec(code, module.dict) File "C:\generative-models\scripts\demo\video_sampling.py", line 142, in value_dict["cond_frames"] = img + cond_aug * torch.randn_like(img) ^^^^^^^^^^^^^^^^^^^^^ TypeError: randn_like(): argument 'input' (position 1) must be Tensor, not NoneType Seed set to 23 Seed set to 23 Seed set to 23 ############################## Sampling setting ############################## Sampler: EulerEDMSampler Discretization: EDMDiscretization Guider: LinearPredictionGuider Sampling with EulerEDMSampler for 26 steps: 0%| | 0/26 [00:00<?, ?it/s]C:\generative-models.pt2\Lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( C:\generative-models.pt2\Lib\site-packages\torch\utils\checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn( Sampling with EulerEDMSampler for 26 steps: 96%|████████████████████████████████████▌ | 25/26 [01:14<00:02, 2.99s/it] OpenCV: FFMPEG: tag 0x5634504d/'MP4V' is not supported with codec id 12 and format 'mp4 / MP4 (MPEG-4 Part 14)' OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v' ffmpeg version 6.0-essentials_build-www.gyan.dev Copyright (c) 2000-2023 the FFmpeg developers built with gcc 12.2.0 (Rev10, Built by MSYS2 project) configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-zlib --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-sdl2 --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxvid --enable-libaom --enable-libopenjpeg --enable-libvpx --enable-mediafoundation --enable-libass --enable-libfreetype --enable-libfribidi --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-ffnvcodec --enable-nvdec --enable-nvenc --enable-d3d11va --enable-dxva2 --enable-libvpl --enable-libgme --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libtheora --enable-libvo-amrwbenc --enable-libgsm --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-librubberband libavutil 58. 2.100 / 58. 2.100 libavcodec 60. 3.100 / 60. 3.100 libavformat 60. 3.100 / 60. 3.100 libavdevice 60. 1.100 / 60. 1.100 libavfilter 9. 3.100 / 9. 3.100 libswscale 7. 1.100 / 7. 1.100 libswresample 4. 10.100 / 4. 10.100 libpostproc 57. 1.100 / 57. 1.100 Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'outputs/demo/vid/svd_image_decoder\samples\000000.mp4': Metadata: major_brand : isom minor_version : 512 compatible_brands: isomiso2mp41 encoder : Lavf58.76.100 Duration: 00:00:02.33, start: 0.000000, bitrate: 2118 kb/s Stream #0:00x1: Video: mpeg4 (Simple Profile) (mp4v / 0x7634706D), yuv420p, 1024x576 [SAR 1:1 DAR 16:9], 2115 kb/s, 6 fps, 6 tbr, 12288 tbn (default) Metadata: handler_name : VideoHandler vendor_id : [0][0][0][0] Stream mapping: Stream #0:0 -> #0:0 (mpeg4 (native) -> h264 (libx264)) Press [q] to stop, [?] for help [libx264 @ 0000024ea02d9980] using SAR=1/1 [libx264 @ 0000024ea02d9980] using cpu capabilities: MMX2 SSE2Fast SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2 AVX512 [libx264 @ 0000024ea02d9980] profile High, level 3.1, 4:2:0, 8-bit [libx264 @ 0000024ea02d9980] 264 - core 164 r3106 eaa68fa - H.264/MPEG-4 AVC codec - Copyleft 2003-2023 - http://www.videolan.org/x264.html - options: cabac=1 ref=3 deblock=1:0:0 analyse=0x3:0x113 me=hex subme=7 psy=1 psy_rd=1.00:0.00 mixed_ref=1 me_range=16 chroma_me=1 trellis=1 8x8dct=1 cqm=0 deadzone=21,11 fast_pskip=1 chroma_qp_offset=-2 threads=18 lookahead_threads=3 sliced_threads=0 nr=0 decimate=1 interlaced=0 bluray_compat=0 constrained_intra=0 bframes=3 b_pyramid=2 b_adapt=1 b_bias=0 direct=1 weightb=1 open_gop=0 weightp=2 keyint=250 keyint_min=6 scenecut=40 intra_refresh=0 rc_lookahead=40 rc=crf mbtree=1 crf=23.0 qcomp=0.60 qpmin=0 qpmax=69 qpstep=4 ip_ratio=1.40 aq=1:1.00 Output #0, mp4, to 'outputs/demo/vid/svd_image_decoder\samples\000000_h264.mp4': Metadata: major_brand : isom minor_version : 512 compatible_brands: isomiso2mp41 encoder : Lavf60.3.100 Stream #0:0(und): Video: h264 (avc1 / 0x31637661), yuv420p(progressive), 1024x576 [SAR 1:1 DAR 16:9], q=2-31, 6 fps, 12288 tbn (default) Metadata: handler_name : VideoHandler vendor_id : [0][0][0][0] encoder : Lavc60.3.100 libx264 Side data: cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: N/A frame= 14 fps=0.0 q=-1.0 Lsize= 438kB time=00:00:01.83 bitrate=1958.8kbits/s speed=9.56x video:437kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.233119% [libx264 @ 0000024ea02d9980] frame I:1 Avg QP:17.95 size: 41500 [libx264 @ 0000024ea02d9980] frame P:4 Avg QP:20.26 size: 41412 [libx264 @ 0000024ea02d9980] frame B:9 Avg QP:21.32 size: 26667 [libx264 @ 0000024ea02d9980] consecutive B-frames: 14.3% 0.0% 0.0% 85.7% [libx264 @ 0000024ea02d9980] mb I I16..4: 21.1% 75.4% 3.5% [libx264 @ 0000024ea02d9980] mb P I16..4: 4.7% 35.1% 4.4% P16..4: 28.3% 19.1% 7.1% 0.0% 0.0% skip: 1.3% [libx264 @ 0000024ea02d9980] mb B I16..4: 0.9% 5.4% 0.7% B16..8: 49.5% 20.2% 6.2% direct: 7.9% skip: 9.2% L0:54.1% L1:20.3% BI:25.5% [libx264 @ 0000024ea02d9980] 8x8 transform intra:77.7% inter:63.4% [libx264 @ 0000024ea02d9980] coded y,uvDC,uvAC intra: 67.4% 42.2% 16.3% inter: 48.4% 14.7% 1.7% [libx264 @ 0000024ea02d9980] i16 v,h,dc,p: 51% 14% 23% 12% [libx264 @ 0000024ea02d9980] i8 v,h,dc,ddl,ddr,vr,hd,vl,hu: 26% 20% 32% 3% 4% 3% 3% 5% 4% [libx264 @ 0000024ea02d9980] i4 v,h,dc,ddl,ddr,vr,hd,vl,hu: 22% 20% 14% 6% 10% 7% 9% 6% 7% [libx264 @ 0000024ea02d9980] i8c dc,h,v,p: 60% 18% 21% 1% [libx264 @ 0000024ea02d9980] Weighted P-Frames: Y:0.0% UV:0.0% [libx264 @ 0000024ea02d9980] ref P L0: 69.0% 21.0% 7.5% 2.5% [libx264 @ 0000024ea02d9980] ref B L0: 95.2% 4.1% 0.7% [libx264 @ 0000024ea02d9980] ref B L1: 97.8% 2.2% [libx264 @ 0000024ea02d9980] kb/s:1533.09

https://github.com/Stability-AI/generative-models/assets/24534698/f20db86b-1b4d-440e-9326-a1b9f142968a

alexfredo commented 9 months ago

@NicerY When there's a memory error it's write CUDA out of memory like with stable diffusion

SmileTAT commented 9 months ago

what's difference between svd_image_decoder.safetensors and svd_xt.safetensors?

wandrzej commented 9 months ago

@alexfredo Can I ask about the vertical video? Did you just forced different resolution and it worked out of the box? There's some info that it was trained specifically on 1024x576 Is anything else needed? They suggest in the code to increase the augmentation conditioning, but you result looks quite good, so wonder if any other changes were necessary