Open C00reNUT opened 6 days ago
Yeah. I also want to know how much VRAM required for inference.
Same question. Would be good to know VRAM usage for various dimensions.
8 GiB is not enough :crying_cat_face:
even 16GB is not enough
even 24GB is not enough
need a 8-bit version
reference it:
Needs 32 GB at least ? Quant anyone ?
I modified the inference script, i made it run with max usage of 15264 MiB of Vram (according to nvtop, inference done with resolution 512x768 and 100 frames). You may need to turn off anything else that uses vram if you're using a 16GiB gpu, but it should work.
i put the modified files here: https://github.com/KT313/LTX_Video_better_vram
it should work if you just drag and drop the files into your LTX-Video folder.
it works by basically offloading everything that is not needed in vram to cpu memory during each of the inference steps.
@KT313 cool, I'll try your solution Edit : It works, will it need more VRAM if more frames generated ? Edit2 : It only works 1st time and then it shows error :
ValueError: Cannot generate a cpu tensor from a generator of type cuda.
Edit3 : Now it works again if using suggested resolution (previously I was testing at 384x672, works at 512x768 30 frames and repeated it, dont know why the error above though
Edit4: Error above appears again when using 60 frames, maybe OOM error then
@x4080 i made some modifications here so the tensors should get generated on the generators device (cuda): https://github.com/KT313/LTX_Video_better_vram/tree/test I cannot test it currently though, let me know if that works better
and regarding your first edit: yes, since the size of the latent tensor (that basically contains the video) depends on the resolution (height x width x frames (+ a bit extra from padding)), increasing frames will make the tensor larger which will need more vram. But actually i think that compared to the vram needed for the unet model, the tensor itself is quite small so you might be able to increase the frames a bit without issues
@x4080 i made some modifications here so the tensors should get generated on the generators device (cuda): https://github.com/KT313/LTX_Video_better_vram/tree/test I cannot test it currently though, let me know if that works better
and regarding your first edit: yes, since the size of the latent tensor (that basically contains the video) depends on the resolution (height x width x frames (+ a bit extra from padding)), increasing frames will make the tensor larger which will need more vram. But actually i think that compared to the vram needed for the unet model, the tensor itself is quite small so you might be able to increase the frames a bit without issues
First of all, thank you for implementing this so that it takes less VRAM. I have tried it out a couple of times (with resolution of 704x480 and for 257 frames) and it works like a charm using only around 16 GB of a 4090 GPU. However, it randomly throws the an error related to "cpu" and "cuda" tensors. Re-running the script usually works, so it is not a big deal.
This was the error:
Traceback (most recent call last):
File "/home/mrt/Projects/LTX-Video/inference.py", line 452, in <module>
main()
File "/home/mrt/Projects/LTX-Video/inference.py", line 356, in main
images = pipeline(
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/ltx_video/pipelines/pipeline_ltx_video.py", line 1039, in __call__
noise_pred = self.transformer(
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/ltx_video/models/transformers/transformer3d.py", line 419, in forward
encoder_hidden_states = self.caption_projection(encoder_hidden_states)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 1607, in forward
hidden_states = self.linear_1(caption)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/home/mrt/Projects/LTX-Video/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
@MarcosRodrigoT Do you use the new test file from @KT313 ? Or the previous one ? @KT313 is your new test code for multiple GPUs ?
Edit : I tried the test file and it works more frames then previous, but see the same error and retry it and somehow it works, what really is going on - why restarting the command works
Edit2: @KT313 maybe this line is making CUDA and cpu inconsistencies ? (in inference.py)
if torch.cuda.is_available() and args.disable_load_needed_only:
pipeline = pipeline.to("cuda")
Edit 4 : I think it works better if above replaced with just
pipeline = pipeline.to("cuda")
to prevent
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)
@x4080 i changed the code on the test branch to
if torch.cuda.is_available():
pipeline = pipeline.to("cuda")
as you suggested. you might be able to get away with less than 16GiB if you don't load the whole pipeline to cuda in the beginning and first load only the text encoder, then unload it and then load the unet, but that would require more trying around so if your suggestion works it's the easiest for now.
I tried it on single-gpu only (4090). not sure about multi-gpu, but the original code also doesn't have anything that specifically hints towards multi-gpu, at least not in the parts that i modified.
@KT313 thanks
btw just for future readers, you might be able to get away with something as low as 8 or 6 GB if the text embedding gets done on cpu or separately somehow. the generation model itself should only need about 4-5GiB if loaded in bfloat16 (2 bytes per parameter) + some extra for the latent video tensor. Most of the vram currently gets clogged up by the text_embedding model which is comparatively huge. If the text gets embedded to tensors on cpu it might be pretty slow though.
@KT313 I tried with width:1280, height:704, num_frames:201, fps = 16 The video is fine till 160 frames but after 41 frames it's not good, having noise in frames why??
@anujsinha72094 pretty unlikely to be related to the changes i made lol
Small passage about VRAM info would be nice :)