I tried A100 (40GB SXM4) with 30 vCPUs, 200 GiB RAM, 512 GiB SSD but immediately CUDA out of memory.
which card / config shall i use? 8x A100 80GB? 1x H100 80GB? 8x H100 80GB?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation
(opensora) ubuntu@129-146-126-183:~/opensora-arizona/Open-Sora-Plan$ python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
Downloading...
From (original): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5
From (redirected): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5&confirm=t&uuid=edea95d1-1e18-41c1-8b57-966749fb41ad
To: /home/ubuntu/opensora-arizona/Open-Sora-Plan/ucf101_stride4x4x4
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 258M/258M [00:05<00:00, 45.4MB/s]
sample_frames_len 500, only can sample 300 assets/origin_video_0.mp4 300
Traceback (most recent call last):
File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 110, in
main(args)
File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 92, in main
encodings, embeddings = vqvae.encode(x_vae, include_embeddings=True)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 38, in encode
h = self.pre_vq_conv(self.encoder(x))
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 241, in forward
h = self.res_stack(h)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 125, in forward
return x + self.block(x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 104, in forward
x = self.attn_w(x, x, x) + self.attn_h(x, x, x) + self.attn_t(x, x, x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 193, in forward
a = self.attn(q, k, v, decode_step, decode_idx)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 244, in forward
out = scaled_dot_product_attention(q, k, v, training=self.training)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 500, in scaled_dot_product_attention
attn = torch.matmul(q, k.transpose(-1, -2))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I tried A100 (40GB SXM4) with 30 vCPUs, 200 GiB RAM, 512 GiB SSD but immediately CUDA out of memory.
which card / config shall i use? 8x A100 80GB? 1x H100 80GB? 8x H100 80GB?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation
(opensora) ubuntu@129-146-126-183:~/opensora-arizona/Open-Sora-Plan$ python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1 /home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead. warnings.warn( /home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead. warnings.warn( Downloading... From (original): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5 From (redirected): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5&confirm=t&uuid=edea95d1-1e18-41c1-8b57-966749fb41ad To: /home/ubuntu/opensora-arizona/Open-Sora-Plan/ucf101_stride4x4x4 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 258M/258M [00:05<00:00, 45.4MB/s] sample_frames_len 500, only can sample 300 assets/origin_video_0.mp4 300 Traceback (most recent call last): File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 110, in
main(args)
File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 92, in main
encodings, embeddings = vqvae.encode(x_vae, include_embeddings=True)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 38, in encode
h = self.pre_vq_conv(self.encoder(x))
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 241, in forward
h = self.res_stack(h)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 125, in forward
return x + self.block(x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 104, in forward
x = self.attn_w(x, x, x) + self.attn_h(x, x, x) + self.attn_t(x, x, x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 193, in forward
a = self.attn(q, k, v, decode_step, decode_idx)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 244, in forward
out = scaled_dot_product_attention(q, k, v, training=self.training)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 500, in scaled_dot_product_attention
attn = torch.matmul(q, k.transpose(-1, -2))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF