THUDM / CogVideo

Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
7.23k stars 666 forks source link

[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432 #233

Open chenxinli001 opened 1 week ago

chenxinli001 commented 1 week ago

System Info / 系統信息

When I fine-tune CogVideoX-2B, i found that almost all the steps are skipped, and the loss scale is very large.

Information / 问题信息

Reproduction / 复现过程

just run:

! /bin/bash

echo "RUN on hostname, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"

environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"

run_cmd="$environs python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"

echo ${run_cmd} eval ${run_cmd}

echo "DONE on hostname"

Expected behavior / 期待表现

Is this normal?

zRzRzRzRzRzRzR commented 1 week ago

nop, can you share the log?

chenxinli001 commented 1 week ago

(cogvideo) ubuntu@instance-butter:/data3/cx_workspace/CogV/CogVideo/sat$ bash finetune_single_gpu.sh RUN on instance-butter, CUDA_VISIBLE_DEVICES=6 WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 22338 [2024-09-03 02:36:29,937] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32) /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") [2024-09-03 02:36:34,890] [INFO] using world size: 1 [2024-09-03 02:36:34,891] [INFO] Will override arguments with manually specified deepspeed_config! [2024-09-03 02:36:34,893] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-09-03 02:36:34,894] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-03 02:36:34,922] [INFO] [RANK 0] building SATVideoDiffusionEngine model ... [2024-09-03 02:36:44,771] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-09-03 02:36:44,904] [INFO] [RANK 0] replacing layer 1 attention with lora [2024-09-03 02:36:45,021] [INFO] [RANK 0] replacing layer 2 attention with lora [2024-09-03 02:36:45,131] [INFO] [RANK 0] replacing layer 3 attention with lora [2024-09-03 02:36:45,241] [INFO] [RANK 0] replacing layer 4 attention with lora [2024-09-03 02:36:45,352] [INFO] [RANK 0] replacing layer 5 attention with lora [2024-09-03 02:36:45,466] [INFO] [RANK 0] replacing layer 6 attention with lora [2024-09-03 02:36:45,575] [INFO] [RANK 0] replacing layer 7 attention with lora [2024-09-03 02:36:45,687] [INFO] [RANK 0] replacing layer 8 attention with lora [2024-09-03 02:36:45,851] [INFO] [RANK 0] replacing layer 9 attention with lora [2024-09-03 02:36:45,957] [INFO] [RANK 0] replacing layer 10 attention with lora [2024-09-03 02:36:46,063] [INFO] [RANK 0] replacing layer 11 attention with lora [2024-09-03 02:36:46,173] [INFO] [RANK 0] replacing layer 12 attention with lora [2024-09-03 02:36:46,280] [INFO] [RANK 0] replacing layer 13 attention with lora [2024-09-03 02:36:46,387] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-09-03 02:36:46,495] [INFO] [RANK 0] replacing layer 15 attention with lora [2024-09-03 02:36:46,606] [INFO] [RANK 0] replacing layer 16 attention with lora [2024-09-03 02:36:46,761] [INFO] [RANK 0] replacing layer 17 attention with lora [2024-09-03 02:36:46,901] [INFO] [RANK 0] replacing layer 18 attention with lora [2024-09-03 02:36:47,044] [INFO] [RANK 0] replacing layer 19 attention with lora [2024-09-03 02:36:47,171] [INFO] [RANK 0] replacing layer 20 attention with lora [2024-09-03 02:36:47,291] [INFO] [RANK 0] replacing layer 21 attention with lora [2024-09-03 02:36:47,397] [INFO] [RANK 0] replacing layer 22 attention with lora [2024-09-03 02:36:47,506] [INFO] [RANK 0] replacing layer 23 attention with lora [2024-09-03 02:36:47,610] [INFO] [RANK 0] replacing layer 24 attention with lora [2024-09-03 02:36:47,774] [INFO] [RANK 0] replacing layer 25 attention with lora [2024-09-03 02:36:47,881] [INFO] [RANK 0] replacing layer 26 attention with lora [2024-09-03 02:36:47,986] [INFO] [RANK 0] replacing layer 27 attention with lora [2024-09-03 02:36:48,095] [INFO] [RANK 0] replacing layer 28 attention with lora [2024-09-03 02:36:48,208] [INFO] [RANK 0] replacing layer 29 attention with lora Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.28s/it] Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False Working with z of shape (1, 16, 32, 32) = 16384 dimensions. /data3/cx_workspace/CogV/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(path, map_location="cpu")["state_dict"] Deleting key loss.logvar from state_dict. Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict. Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict. Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.to_logits.0.weight from state_dict. Deleting key loss.discriminator.to_logits.0.bias from state_dict. Deleting key loss.discriminator.to_logits.3.weight from state_dict. Deleting key loss.discriminator.to_logits.3.bias from state_dict. Missing keys: [] Unexpected keys: [] Restored from CogVideoX-2b-sat/vae/3d-vae.pt [2024-09-03 02:36:56,856] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755 [2024-09-03 02:37:15,810] [INFO] [RANK 0] global rank 0 is loading checkpoint CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/sat/training/model_io.py:286: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(checkpoint_name, map_location='cpu') [2024-09-03 02:37:17,758] [INFO] [RANK 0] > successfully loaded CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt [2024-09-03 02:37:18,437] [INFO] [RANK 0] Total trainable parameters: 58982400 [2024-09-03 02:37:18,437] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay [2024-09-03 02:37:18,440] [INFO] [RANK 0] Syncing initialized parameters... [2024-09-03 02:37:18,503] [INFO] [RANK 0] Finished syncing initialized parameters. [2024-09-03 02:37:18,503] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat. [2024-09-03 02:37:18,503] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown [2024-09-03 02:37:18,503] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-09-03 02:37:18,646] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /data1/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /data1/.cache/torch_extensions/py310_cu121/fused_ema_adam/build.ninja... /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module fused_ema_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_ema_adam... Time to load fused_ema_adam op: 0.07278060913085938 seconds [2024-09-03 02:37:18,724] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer [2024-09-03 02:37:18,725] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-09-03 02:37:18,762] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam [2024-09-03 02:37:18,763] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'> [2024-09-03 02:37:18,763] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] ** You are using ZeRO with an untested optimizer, proceed with caution *** [2024-09-03 02:37:18,763] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 1000000000 [2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 1000000000 [2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False [2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False [2024-09-03 02:37:23,295] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2024-09-03 02:37:23,295] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB [2024-09-03 02:37:23,295] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 478.67 GB, percent = 23.7% [2024-09-03 02:37:23,814] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2024-09-03 02:37:23,814] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB [2024-09-03 02:37:23,814] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 483.06 GB, percent = 24.0% [2024-09-03 02:37:23,815] [INFO] [stage_1_and_2.py:543:init] optimizer state initialized [2024-09-03 02:37:24,129] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2024-09-03 02:37:24,130] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB [2024-09-03 02:37:24,130] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 485.83 GB, percent = 24.1% [2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]] [2024-09-03 02:37:24,137] [INFO] [config.py:997:print] DeepSpeedEngine configuration: [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] amp_enabled .................. False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] amp_params ................... False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] bfloat16_enabled ............. False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x15507917fbb0> [2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] communication_data_type ...... None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dataloader_drop_last ......... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] disable_allgather ............ False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dump_state ................... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] elasticity_enabled ........... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_auto_cast ............... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_enabled ................. True [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] global_rank .................. 0 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] grad_accum_dtype ............. None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] graph_harvesting ............. False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] load_universal_checkpoint .... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] loss_scale ................... 0 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] memory_breakdown ............. False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] mics_shard_size .............. -1 [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_name ............... None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_params ............. None [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pld_enabled .................. False [2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pld_params ................... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] prescale_gradients ........... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] scheduler_name ............... None [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] scheduler_params ............. None [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] sparse_attention ............. None [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] steps_per_print .............. 50 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] train_batch_size ............. 2 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 2 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] use_data_before_expertparallel False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] use_node_local_storage ....... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] weight_quantization_config ... None [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] world_size ................... 1 [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_enabled ................. True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True [2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2 [2024-09-03 02:37:24,139] [INFO] [config.py:987:print_user_config] json = { "train_micro_batch_size_per_gpu": 2, "gradient_accumulation_steps": 1, "steps_per_print": 50, "gradient_clipping": 0.1, "zero_optimization": { "stage": 2, "cpu_offload": false, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "allgather_bucket_size": 1.000000e+09, "load_from_fp32_weights": false }, "zero_allow_untested_optimizer": true, "bf16": { "enabled": false }, "fp16": { "enabled": true }, "loss_scale": 0, "loss_scale_window": 400, "hysteresis": 2, "min_loss_scale": 1, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false }, "wall_clock_breakdown": false } [2024-09-03 02:37:24,139] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0 [2024-09-03 02:37:24,139] [INFO] [RANK 0] Finetuning Model... [2024-09-03 02:37:24,139] [INFO] [RANK 0] arguments: [2024-09-03 02:37:24,139] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml'] [2024-09-03 02:37:24,139] [INFO] [RANK 0] model_parallel_size .......... 1 [2024-09-03 02:37:24,139] [INFO] [RANK 0] force_pretrain ............... False [2024-09-03 02:37:24,139] [INFO] [RANK 0] device ....................... 0 [2024-09-03 02:37:24,139] [INFO] [RANK 0] debug ........................ False [2024-09-03 02:37:24,139] [INFO] [RANK 0] log_image .................... True [2024-09-03 02:37:24,139] [INFO] [RANK 0] output_dir ................... samples [2024-09-03 02:37:24,139] [INFO] [RANK 0] input_dir .................... None [2024-09-03 02:37:24,139] [INFO] [RANK 0] input_type ................... cli [2024-09-03 02:37:24,139] [INFO] [RANK 0] input_file ................... input.txt [2024-09-03 02:37:24,139] [INFO] [RANK 0] final_size ................... 2048 [2024-09-03 02:37:24,140] [INFO] [RANK 0] sdedit ....................... False [2024-09-03 02:37:24,140] [INFO] [RANK 0] grid_num_rows ................ 1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] force_inference .............. False [2024-09-03 02:37:24,140] [INFO] [RANK 0] lcm_steps .................... None [2024-09-03 02:37:24,140] [INFO] [RANK 0] sampling_num_frames .......... 32 [2024-09-03 02:37:24,140] [INFO] [RANK 0] sampling_fps ................. 8 [2024-09-03 02:37:24,140] [INFO] [RANK 0] only_save_latents ............ False [2024-09-03 02:37:24,140] [INFO] [RANK 0] only_log_video_latents ....... True [2024-09-03 02:37:24,140] [INFO] [RANK 0] latent_channels .............. 32 [2024-09-03 02:37:24,140] [INFO] [RANK 0] image2video .................. False [2024-09-03 02:37:24,140] [INFO] [RANK 0] experiment_name .............. example_data-09-03-02-36 [2024-09-03 02:37:24,140] [INFO] [RANK 0] train_iters .................. 1000 [2024-09-03 02:37:24,140] [INFO] [RANK 0] batch_size ................... 2 [2024-09-03 02:37:24,140] [INFO] [RANK 0] lr ........................... 0.001 [2024-09-03 02:37:24,140] [INFO] [RANK 0] mode ......................... finetune [2024-09-03 02:37:24,140] [INFO] [RANK 0] seed ......................... 22338 [2024-09-03 02:37:24,140] [INFO] [RANK 0] zero_stage ................... 0 [2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_activations ....... True [2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_num_layers ........ 1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_skip_layers ....... 0 [2024-09-03 02:37:24,140] [INFO] [RANK 0] fp16 ......................... True [2024-09-03 02:37:24,140] [INFO] [RANK 0] bf16 ......................... False [2024-09-03 02:37:24,140] [INFO] [RANK 0] gradient_accumulation_steps .. 1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] profiling .................... -1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] epochs ....................... None [2024-09-03 02:37:24,140] [INFO] [RANK 0] log_interval ................. 20 [2024-09-03 02:37:24,140] [INFO] [RANK 0] summary_dir .................. [2024-09-03 02:37:24,140] [INFO] [RANK 0] save_args .................... False [2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_iters ............... None [2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_style ............... linear [2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_ratio ............... 0.1 [2024-09-03 02:37:24,140] [INFO] [RANK 0] warmup ....................... 0.01 [2024-09-03 02:37:24,140] [INFO] [RANK 0] weight_decay ................. 0.0001 [2024-09-03 02:37:24,140] [INFO] [RANK 0] save ......................... ckpts_2b/example_data-09-03-02-36 [2024-09-03 02:37:24,140] [INFO] [RANK 0] load ......................... CogVideoX-2b-sat/transformer [2024-09-03 02:37:24,140] [INFO] [RANK 0] force_train .................. True [2024-09-03 02:37:24,140] [INFO] [RANK 0] save_interval ................ 500 [2024-09-03 02:37:24,140] [INFO] [RANK 0] no_save_rng .................. False [2024-09-03 02:37:24,140] [INFO] [RANK 0] no_load_rng .................. True [2024-09-03 02:37:24,140] [INFO] [RANK 0] resume_dataloader ............ False [2024-09-03 02:37:24,141] [INFO] [RANK 0] distributed_backend .......... nccl [2024-09-03 02:37:24,141] [INFO] [RANK 0] local_rank ................... 0 [2024-09-03 02:37:24,141] [INFO] [RANK 0] exit_interval ................ None [2024-09-03 02:37:24,141] [INFO] [RANK 0] wandb ........................ False [2024-09-03 02:37:24,141] [INFO] [RANK 0] wandb_project_name ........... default_project [2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_batch_size .............. 1 [2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_iters ................... 1 [2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_interval ................ 100 [2024-09-03 02:37:24,141] [INFO] [RANK 0] strict_eval .................. False [2024-09-03 02:37:24,141] [INFO] [RANK 0] train_data ................... ['toy_data'] [2024-09-03 02:37:24,141] [INFO] [RANK 0] train_data_weights ........... None [2024-09-03 02:37:24,141] [INFO] [RANK 0] iterable_dataset ............. False [2024-09-03 02:37:24,141] [INFO] [RANK 0] iterable_dataset_eval ........ [2024-09-03 02:37:24,141] [INFO] [RANK 0] batch_from_same_dataset ...... False [2024-09-03 02:37:24,141] [INFO] [RANK 0] valid_data ................... ['toy_data'] [2024-09-03 02:37:24,141] [INFO] [RANK 0] test_data .................... None [2024-09-03 02:37:24,141] [INFO] [RANK 0] split ........................ 1,0,0 [2024-09-03 02:37:24,141] [INFO] [RANK 0] num_workers .................. 8 [2024-09-03 02:37:24,141] [INFO] [RANK 0] block_size ................... 10000 [2024-09-03 02:37:24,141] [INFO] [RANK 0] prefetch_factor .............. 4 [2024-09-03 02:37:24,141] [INFO] [RANK 0] deepspeed .................... True [2024-09-03 02:37:24,141] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False} [2024-09-03 02:37:24,141] [INFO] [RANK 0] deepscale .................... False [2024-09-03 02:37:24,141] [INFO] [RANK 0] deepscale_config ............. None [2024-09-03 02:37:24,141] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': 'CogVideoX-2b-sat/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': 'CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}} [2024-09-03 02:37:24,141] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}} [2024-09-03 02:37:24,141] [INFO] [RANK 0] cuda ......................... True [2024-09-03 02:37:24,142] [INFO] [RANK 0] rank ......................... 0 [2024-09-03 02:37:24,142] [INFO] [RANK 0] world_size ................... 1 [2024-09-03 02:37:24,142] [INFO] [RANK 0] deepspeed_activation_checkpointing True [2024-09-03 02:37:24,142] [INFO] [RANK 0] master_ip .................... localhost [2024-09-03 02:37:24,142] [INFO] [RANK 0] master_port .................. 38137 [2024-09-03 02:37:24,142] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': 'CogVideoX-2b-sat/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': 'CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'example_data', 'mode': 'finetune', 'load': 'CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 1000, 'eval_iters': 1, 'eval_interval': 100, 'eval_batch_size': 1, 'save': 'ckpts_2b', 'save_interval': 500, 'log_interval': 20, 'train_data': ['toy_data'], 'valid_data': ['toy_data'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}] [2024-09-03 02:37:24,142] [INFO] [RANK 0] do_train ..................... True [2024-09-03 02:37:24,142] [INFO] [RANK 0] val_last_shape ............... [] [2024-09-03 02:37:24,142] [INFO] [RANK 0] val_drop_number .............. 0 [2024-09-03 02:37:24,142] [INFO] [RANK 0] do_valid ..................... True [2024-09-03 02:37:24,142] [INFO] [RANK 0] do_test ...................... False [2024-09-03 02:37:24,142] [INFO] [RANK 0] iteration .................... 0 [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:545:forward] ----Synchronization False [2024-09-03 02:38:05,330] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False [2024-09-03 02:38:17,054] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648 [2024-09-03 02:38:39,582] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824 [2024-09-03 02:39:01,823] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912 [2024-09-03 02:39:25,772] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456 [2024-09-03 02:40:34,024] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728 [2024-09-03 02:42:31,555] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864 [2024-09-03 02:43:17,654] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432 [2024-09-03 02:45:38,242] [INFO] [RANK 0] iteration 20/ 1000 | elapsed time per iteration (ms): 24611.4 | learning rate 5.000E-05 | total loss 2.157213E-01 | loss 2.157214E-01 | loss scale 33554432.0 |speed 4.88 samples/(minGPU) [2024-09-03 02:45:38,244] [INFO] [RANK 0] after 20 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 64453.90478515625 | cached: 22772.0 | max cached: 76136.0 [2024-09-03 02:45:38,244] [INFO] [RANK 0] time (ms) | forward: 15575.33 | backward: 8930.58 | allreduce: 0.00 | optimizer: 101.48 | data loader: 19.31 [2024-09-03 02:53:15,005] [INFO] [RANK 0] iteration 40/ 1000 | elapsed time per iteration (ms): 22838.1 | learning rate 5.000E-05 | total loss 2.180460E-01 | loss 2.180460E-01 | loss scale 33554432.0 |speed 5.25 samples/(minGPU) [2024-09-03 02:53:15,006] [INFO] [RANK 0] time (ms) | forward: 13797.73 | backward: 8961.48 | allreduce: 0.00 | optimizer: 74.96 | data loader: 0.37 [2024-09-03 02:54:01,688] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216 [2024-09-03 02:57:02,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=8, lr=[5e-05], mom=[[0.9, 0.95]] bash finetune_single_gpu.sh RUN on instance-butter, CUDA_VISIBLE_DEVICES=6 WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 22338 [2024-09-03 02:36:29,937] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32) /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") [2024-09-03 02:36:34,890] [INFO] using world size: 1 [2024-09-03 02:36:34,891] [INFO] Will override arguments with manually specified deepspeed_config! [2024-09-03 02:36:34,893] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-09-03 02:36:34,894] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-03 02:36:34,922] [INFO] [RANK 0] building SATVideoDiffusionEngine model ... [2024-09-03 02:36:44,771] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-09-03 02:36:44,904] [INFO] [RANK 0] replacing layer 1 attention with lora [2024-09-03 02:36:45,021] [INFO] [RANK 0] replacing layer 2 attention with lora [2024-09-03 02:36:45,131] [INFO] [RANK 0] replacing layer 3 attention with lora [2024-09-03 02:36:45,241] [INFO] [RANK 0] replacing layer 4 attention with lora [2024-09-03 02:36:45,352] [INFO] [RANK 0] replacing layer 5 attention with lora [2024-09-03 02:36:45,466] [INFO] [RANK 0] replacing layer 6 attention with lora [2024-09-03 02:36:45,575] [INFO] [RANK 0] replacing layer 7 attention with lora [2024-09-03 02:36:45,687] [INFO] [RANK 0] replacing layer 8 attention with lora [2024-09-03 02:36:45,851] [INFO] [RANK 0] replacing layer 9 attention with lora [2024-09-03 02:36:45,957] [INFO] [RANK 0] replacing layer 10 attention with lora [2024-09-03 02:36:46,063] [INFO] [RANK 0] replacing layer 11 attention with lora [2024-09-03 02:36:46,173] [INFO] [RANK 0] replacing layer 12 attention with lora [2024-09-03 02:36:46,280] [INFO] [RANK 0] replacing layer 13 attention with lora [2024-09-03 02:36:46,387] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-09-03 02:36:46,495] [INFO] [RANK 0] replacing l[2024-09-03 03:00:52,438] [INFO] [RANK 0] iteration 60/ 1000 | elapsed time per iteration (ms): 22871.6 | learning rate 5.000E-05 | total loss 2.016617E-01 | loss 2.016617E-01 | loss scale 16777216.0 |speed 5.25 samples/(min*GPU) [2024-09-03 03:00:52,438] [INFO] [RANK 0] time (ms) | forward: 13888.08 | backward: 8902.19 | allreduce: 0.00 | optimizer: 77.79 | data loader: 0.66

chenxinli001 commented 1 week ago

跑的是 finetune_single_gpu.sh

export CUDA_VISIBLE_DEVICES=6

echo "RUN on `hostname`, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"

environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"

run_cmd="$environs python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"

echo ${run_cmd}
eval ${run_cmd}

echo "DONE on `hostname`"

频繁遇到 下面这种 超大的loss scale, 并且显示skip the steps, 正常嘛?

[2024-09-03 02:38:17,054] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648 [2024-09-03 02:38:39,582] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824 [2024-09-03 02:39:01,823] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912 [2024-09-03 02:39:25,772] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456 [2024-09-03 02:40:34,024] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728 [2024-09-03 02:42:31,555] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864 [2024-09-03 02:43:17,654] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432

zRzRzRzRzRzRzR commented 1 week ago

不正常,这个scale也不正常,你的数据集体量是

这个看上去是没有任何报错都是跳过了?

TianxingWu commented 1 week ago

Same behavior here on 4xA100 device 👀

kyrie111 commented 1 week ago

same on 8*A800

zRzRzRzRzRzRzR commented 1 week ago

Did everyone skip all the steps? @kyrie111 @TianxingWu, because skipping the first few steps and then continuing with normal training, the loss reduction is a normal phenomenon. The first few steps are skipped because the loss is indeed too large

KihongK commented 1 week ago

My issue is solved #261

same issue on 8 * A100 80G (Tried Single GPU & Multi GPU 8)

I tried only 2b model

sft.yaml ``` args: checkpoint_activations: True ## using gradient checkpointing model_parallel_size: 1 experiment_name: lora-test mode: finetune load: "/root/CogVideo/CogVideoX-2b-sat/transformer" no_load_rng: True train_iters: 100 # Suggest more than 1000 For Lora and SFT For 500 is enough eval_iters: 1 eval_interval: 10 eval_batch_size: 1 save: ckpts_2b_lora save_interval: 50 log_interval: 20 train_data: [ "/root/CogVideo/sat/datasets/test" ] # Train data path valid_data: [ "/root/CogVideo/sat/datasets/test" ] # Validation data path, can be the same as train_data(not recommended) split: 1,0,0 num_workers: 8 force_train: True only_log_video_latents: True data: target: data_video.SFTDataset params: video_size: [ 480, 720 ] fps: 8 max_num_frames: 49 skip_frms_num: 3. deepspeed: # Minimun for 16 videos per batch for ALL GPUs, This setting is for 8 x A100 GPUs train_micro_batch_size_per_gpu: 2 gradient_accumulation_steps: 1 steps_per_print: 50 gradient_clipping: 0.1 zero_optimization: stage: 2 cpu_offload: false contiguous_gradients: false overlap_comm: true reduce_scatter: true reduce_bucket_size: 1000000000 allgather_bucket_size: 1000000000 load_from_fp32_weights: false zero_allow_untested_optimizer: true bf16: enabled: False # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to True fp16: enabled: True # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to False loss_scale: 0 loss_scale_window: 400 hysteresis: 2 min_loss_scale: 1 optimizer: type: sat.ops.FusedEmaAdam params: lr: 0.001 # Between 1E-3 and 5E-4 For Lora and 1E-5 For SFT betas: [ 0.9, 0.95 ] eps: 1e-8 weight_decay: 1e-4 activation_checkpointing: partition_activations: false contiguous_memory_optimization: false wall_clock_breakdown: false ```
cogvideox_2b.yaml ``` model: scale_factor: 1.15258426 disable_first_stage_autocast: true log_keys: - txt denoiser_config: target: sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser params: num_idx: 1000 quantize_c_noise: False weighting_config: target: sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting scaling_config: target: sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 network_config: target: dit_video_concat.DiffusionTransformer params: time_embed_dim: 512 elementwise_affine: True num_frames: 49 time_compressed_rate: 4 latent_width: 90 latent_height: 60 num_layers: 30 patch_size: 2 in_channels: 16 out_channels: 16 hidden_size: 1920 adm_in_channels: 256 num_attention_heads: 30 transformer_args: checkpoint_activations: True ## using gradient checkpointing vocab_size: 1 max_sequence_length: 64 layernorm_order: pre skip_init: false model_parallel_size: 1 is_decoder: false modules: pos_embed_config: target: dit_video_concat.Basic3DPositionEmbeddingMixin params: text_length: 226 height_interpolation: 1.875 width_interpolation: 1.875 patch_embed_config: target: dit_video_concat.ImagePatchEmbeddingMixin params: text_hidden_size: 4096 adaln_layer_config: target: dit_video_concat.AdaLNMixin params: qk_ln: True final_layer_config: target: dit_video_concat.FinalLayerMixin conditioner_config: target: sgm.modules.GeneralConditioner params: emb_models: - is_trainable: false input_key: txt ucg_rate: 0.1 target: sgm.modules.encoders.modules.FrozenT5Embedder params: model_dir: "/root/CogVideo/t5-v1_1-xxl" max_length: 226 first_stage_config: target: vae_modules.autoencoder.VideoAutoencoderInferenceWrapper params: cp_size: 1 ckpt_path: "/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt" ignore_keys: [ 'loss' ] loss_config: target: torch.nn.Identity regularizer_config: target: vae_modules.regularizers.DiagonalGaussianRegularizer encoder_config: target: vae_modules.cp_enc_dec.ContextParallelEncoder3D params: double_z: true z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: True decoder_config: target: vae_modules.cp_enc_dec.ContextParallelDecoder3D params: double_z: True z_channels: 16 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: [ 1, 2, 2, 4 ] attn_resolutions: [ ] num_res_blocks: 3 dropout: 0.0 gather_norm: False loss_fn_config: target: sgm.modules.diffusionmodules.loss.VideoDiffusionLoss params: offset_noise_level: 0 sigma_sampler_config: target: sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling params: uniform_sampling: True num_idx: 1000 discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 sampler_config: target: sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler params: num_steps: 50 verbose: True discretization_config: target: sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization params: shift_scale: 3.0 guider_config: target: sgm.modules.diffusionmodules.guiders.DynamicCFG params: scale: 6 exp: 5 num_steps: 50 ```
[1st Trial] finetune_single_gpu.sh ``` RUN on alphacode-ttv-a100-80g-gpu, CUDA_VISIBLE_DEVICES= WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 21247 [2024-09-09 16:39:11,302] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32) no module 'xformers'. Processing without... no module 'xformers'. Processing without... [2024-09-09 16:39:16,571] [INFO] using world size: 1 [2024-09-09 16:39:16,571] [INFO] Will override arguments with manually specified deepspeed_config! [W909 16:39:16.412494279 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [ip6-localhost]:39375 (errno: 97 - Address family not supported by protocol). [W909 16:39:16.413593009 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [alphacode-ttv-a100-80g-gpu]:39375 (errno: 97 - Address family not supported by protocol). [2024-09-09 16:39:16,591] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-09-09 16:39:16,592] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-09 16:39:16,869] [INFO] [RANK 0] building SATVideoDiffusionEngine model ... [2024-09-09 16:39:26,340] [WARNING] [RANK 0] Failed to load bitsandbytes:No module named 'bitsandbytes' [2024-09-09 16:39:26,340] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-09-09 16:39:26,364] [INFO] [RANK 0] replacing layer 1 attention with lora [2024-09-09 16:39:26,387] [INFO] [RANK 0] replacing layer 2 attention with lora [2024-09-09 16:39:26,411] [INFO] [RANK 0] replacing layer 3 attention with lora [2024-09-09 16:39:26,487] [INFO] [RANK 0] replacing layer 4 attention with lora [2024-09-09 16:39:26,518] [INFO] [RANK 0] replacing layer 5 attention with lora [2024-09-09 16:39:26,542] [INFO] [RANK 0] replacing layer 6 attention with lora [2024-09-09 16:39:26,567] [INFO] [RANK 0] replacing layer 7 attention with lora [2024-09-09 16:39:26,591] [INFO] [RANK 0] replacing layer 8 attention with lora [2024-09-09 16:39:26,621] [INFO] [RANK 0] replacing layer 9 attention with lora [2024-09-09 16:39:26,726] [INFO] [RANK 0] replacing layer 10 attention with lora [2024-09-09 16:39:26,870] [INFO] [RANK 0] replacing layer 11 attention with lora [2024-09-09 16:39:26,999] [INFO] [RANK 0] replacing layer 12 attention with lora [2024-09-09 16:39:27,074] [INFO] [RANK 0] replacing layer 13 attention with lora [2024-09-09 16:39:27,127] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-09-09 16:39:27,206] [INFO] [RANK 0] replacing layer 15 attention with lora [2024-09-09 16:39:27,294] [INFO] [RANK 0] replacing layer 16 attention with lora [2024-09-09 16:39:27,379] [INFO] [RANK 0] replacing layer 17 attention with lora [2024-09-09 16:39:27,446] [INFO] [RANK 0] replacing layer 18 attention with lora [2024-09-09 16:39:27,528] [INFO] [RANK 0] replacing layer 19 attention with lora [2024-09-09 16:39:27,642] [INFO] [RANK 0] replacing layer 20 attention with lora [2024-09-09 16:39:27,715] [INFO] [RANK 0] replacing layer 21 attention with lora [2024-09-09 16:39:27,794] [INFO] [RANK 0] replacing layer 22 attention with lora [2024-09-09 16:39:27,854] [INFO] [RANK 0] replacing layer 23 attention with lora [2024-09-09 16:39:27,930] [INFO] [RANK 0] replacing layer 24 attention with lora [2024-09-09 16:39:27,960] [INFO] [RANK 0] replacing layer 25 attention with lora [2024-09-09 16:39:27,982] [INFO] [RANK 0] replacing layer 26 attention with lora [2024-09-09 16:39:28,004] [INFO] [RANK 0] replacing layer 27 attention with lora [2024-09-09 16:39:28,026] [INFO] [RANK 0] replacing layer 28 attention with lora [2024-09-09 16:39:28,048] [INFO] [RANK 0] replacing layer 29 attention with lora Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.13it/s] Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False Working with z of shape (1, 16, 32, 32) = 16384 dimensions. /root/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(path, map_location="cpu")["state_dict"] Deleting key loss.logvar from state_dict. Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict. Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict. Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.to_logits.0.weight from state_dict. Deleting key loss.discriminator.to_logits.0.bias from state_dict. Deleting key loss.discriminator.to_logits.3.weight from state_dict. Deleting key loss.discriminator.to_logits.3.bias from state_dict. Missing keys: [] Unexpected keys: [] Restored from /root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt [2024-09-09 16:39:32,189] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755 [2024-09-09 16:39:42,369] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(checkpoint_name, map_location='cpu') [2024-09-09 16:39:43,764] [INFO] [RANK 0] > successfully loaded /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt [2024-09-09 16:39:45,132] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 ***** [2024-09-09 16:39:45,132] [INFO] [RANK 0] [, , ] is set to no_weight_decay [2024-09-09 16:39:45,136] [INFO] [RANK 0] Syncing initialized parameters... [2024-09-09 16:39:45,239] [INFO] [RANK 0] Finished syncing initialized parameters. [2024-09-09 16:39:45,239] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat. [2024-09-09 16:39:45,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown [2024-09-09 16:39:45,240] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-09-09 16:39:45,337] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py312_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py312_cu121/fused_ema_adam/build.ninja... /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module fused_ema_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_ema_adam... Time to load fused_ema_adam op: 0.7258331775665283 seconds [2024-09-09 16:39:46,219] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer [2024-09-09 16:39:46,219] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-09-09 16:39:46,239] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam [2024-09-09 16:39:46,239] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type= [2024-09-09 16:39:46,239] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution ***** [2024-09-09 16:39:46,239] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000 [2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000 [2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False [2024-09-09 16:39:46,239] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False [2024-09-09 16:39:48,450] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2024-09-09 16:39:48,450] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB [2024-09-09 16:39:48,451] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 31.17 GB, percent = 1.6% [2024-09-09 16:39:48,690] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2024-09-09 16:39:48,691] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB [2024-09-09 16:39:48,691] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 31.19 GB, percent = 1.6% [2024-09-09 16:39:48,691] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized [2024-09-09 16:39:48,948] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2024-09-09 16:39:48,949] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB [2024-09-09 16:39:48,949] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 31.25 GB, percent = 1.6% [2024-09-09 16:39:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2024-09-09 16:39:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2024-09-09 16:39:48,953] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2024-09-09 16:39:48,954] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]] [2024-09-09 16:39:48,956] [INFO] [config.py:997:print] DeepSpeedEngine configuration: [2024-09-09 16:39:48,957] [INFO] [config.py:1001:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-09-09 16:39:48,957] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-09-09 16:39:48,957] [INFO] [config.py:1001:print] amp_enabled .................. False [2024-09-09 16:39:48,957] [INFO] [config.py:1001:print] amp_params ................... False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] bfloat16_enabled ............. False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] comms_config ................. [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] communication_data_type ...... None [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] dataloader_drop_last ......... False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] disable_allgather ............ False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] dump_state ................... False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1 [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0 [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100 [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06 [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01 [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] elasticity_enabled ........... False [2024-09-09 16:39:48,958] [INFO] [config.py:1001:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] fp16_auto_cast ............... False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] fp16_enabled ................. True [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] global_rank .................. 0 [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] grad_accum_dtype ............. None [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1 [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1 [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0 [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] graph_harvesting ............. False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536 [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] load_universal_checkpoint .... False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] loss_scale ................... 0 [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] memory_breakdown ............. False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] mics_shard_size .............. -1 [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] optimizer_name ............... None [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] optimizer_params ............. None [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] pld_enabled .................. False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] pld_params ................... False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] prescale_gradients ........... False [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] scheduler_name ............... None [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] scheduler_params ............. None [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32 [2024-09-09 16:39:48,959] [INFO] [config.py:1001:print] sparse_attention ............. None [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] steps_per_print .............. 50 [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] train_batch_size ............. 2 [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 2 [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] use_node_local_storage ....... False [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] weight_quantization_config ... None [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] world_size ................... 1 [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_enabled ................. True [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True [2024-09-09 16:39:48,960] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2 [2024-09-09 16:39:48,960] [INFO] [config.py:987:print_user_config] json = { "train_micro_batch_size_per_gpu": 2, "gradient_accumulation_steps": 1, "steps_per_print": 50, "gradient_clipping": 0.1, "zero_optimization": { "stage": 2, "cpu_offload": false, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "allgather_bucket_size": 1.000000e+09, "load_from_fp32_weights": false }, "zero_allow_untested_optimizer": true, "bf16": { "enabled": false }, "fp16": { "enabled": true }, "loss_scale": 0, "loss_scale_window": 400, "hysteresis": 2, "min_loss_scale": 1, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false }, "wall_clock_breakdown": false } [2024-09-09 16:39:48,960] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0 [2024-09-09 16:39:48,960] [INFO] [RANK 0] Finetuning Model... [2024-09-09 16:39:48,960] [INFO] [RANK 0] arguments: [2024-09-09 16:39:48,960] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml'] [2024-09-09 16:39:48,960] [INFO] [RANK 0] model_parallel_size .......... 1 [2024-09-09 16:39:48,960] [INFO] [RANK 0] force_pretrain ............... False [2024-09-09 16:39:48,961] [INFO] [RANK 0] device ....................... 0 [2024-09-09 16:39:48,961] [INFO] [RANK 0] debug ........................ False [2024-09-09 16:39:48,961] [INFO] [RANK 0] log_image .................... True [2024-09-09 16:39:48,961] [INFO] [RANK 0] output_dir ................... samples [2024-09-09 16:39:48,961] [INFO] [RANK 0] input_dir .................... None [2024-09-09 16:39:48,961] [INFO] [RANK 0] input_type ................... cli [2024-09-09 16:39:48,961] [INFO] [RANK 0] input_file ................... input.txt [2024-09-09 16:39:48,961] [INFO] [RANK 0] final_size ................... 2048 [2024-09-09 16:39:48,961] [INFO] [RANK 0] sdedit ....................... False [2024-09-09 16:39:48,961] [INFO] [RANK 0] grid_num_rows ................ 1 [2024-09-09 16:39:48,961] [INFO] [RANK 0] force_inference .............. False [2024-09-09 16:39:48,961] [INFO] [RANK 0] lcm_steps .................... None [2024-09-09 16:39:48,961] [INFO] [RANK 0] sampling_num_frames .......... 32 [2024-09-09 16:39:48,961] [INFO] [RANK 0] sampling_fps ................. 8 [2024-09-09 16:39:48,961] [INFO] [RANK 0] only_save_latents ............ False [2024-09-09 16:39:48,961] [INFO] [RANK 0] only_log_video_latents ....... True [2024-09-09 16:39:48,961] [INFO] [RANK 0] latent_channels .............. 32 [2024-09-09 16:39:48,961] [INFO] [RANK 0] image2video .................. False [2024-09-09 16:39:48,961] [INFO] [RANK 0] experiment_name .............. lora-test-09-09-16-39 [2024-09-09 16:39:48,961] [INFO] [RANK 0] train_iters .................. 100 [2024-09-09 16:39:48,961] [INFO] [RANK 0] batch_size ................... 2 [2024-09-09 16:39:48,961] [INFO] [RANK 0] lr ........................... 0.001 [2024-09-09 16:39:48,961] [INFO] [RANK 0] mode ......................... finetune [2024-09-09 16:39:48,961] [INFO] [RANK 0] seed ......................... 21247 [2024-09-09 16:39:48,961] [INFO] [RANK 0] zero_stage ................... 0 [2024-09-09 16:39:48,961] [INFO] [RANK 0] checkpoint_activations ....... True [2024-09-09 16:39:48,961] [INFO] [RANK 0] checkpoint_num_layers ........ 1 [2024-09-09 16:39:48,961] [INFO] [RANK 0] checkpoint_skip_layers ....... 0 [2024-09-09 16:39:48,961] [INFO] [RANK 0] fp16 ......................... True [2024-09-09 16:39:48,961] [INFO] [RANK 0] bf16 ......................... False [2024-09-09 16:39:48,962] [INFO] [RANK 0] gradient_accumulation_steps .. 1 [2024-09-09 16:39:48,962] [INFO] [RANK 0] profiling .................... -1 [2024-09-09 16:39:48,962] [INFO] [RANK 0] epochs ....................... None [2024-09-09 16:39:48,962] [INFO] [RANK 0] log_interval ................. 20 [2024-09-09 16:39:48,962] [INFO] [RANK 0] summary_dir .................. [2024-09-09 16:39:48,962] [INFO] [RANK 0] save_args .................... False [2024-09-09 16:39:48,962] [INFO] [RANK 0] lr_decay_iters ............... None [2024-09-09 16:39:48,962] [INFO] [RANK 0] lr_decay_style ............... linear [2024-09-09 16:39:48,962] [INFO] [RANK 0] lr_decay_ratio ............... 0.1 [2024-09-09 16:39:48,962] [INFO] [RANK 0] warmup ....................... 0.01 [2024-09-09 16:39:48,962] [INFO] [RANK 0] weight_decay ................. 0.0001 [2024-09-09 16:39:48,962] [INFO] [RANK 0] save ......................... ckpts_2b_lora/lora-test-09-09-16-39 [2024-09-09 16:39:48,962] [INFO] [RANK 0] load ......................... /root/CogVideo/CogVideoX-2b-sat/transformer [2024-09-09 16:39:48,962] [INFO] [RANK 0] force_train .................. True [2024-09-09 16:39:48,962] [INFO] [RANK 0] save_interval ................ 50 [2024-09-09 16:39:48,962] [INFO] [RANK 0] no_save_rng .................. False [2024-09-09 16:39:48,962] [INFO] [RANK 0] no_load_rng .................. True [2024-09-09 16:39:48,962] [INFO] [RANK 0] resume_dataloader ............ False [2024-09-09 16:39:48,962] [INFO] [RANK 0] distributed_backend .......... nccl [2024-09-09 16:39:48,962] [INFO] [RANK 0] local_rank ................... 0 [2024-09-09 16:39:48,962] [INFO] [RANK 0] exit_interval ................ None [2024-09-09 16:39:48,962] [INFO] [RANK 0] wandb ........................ False [2024-09-09 16:39:48,962] [INFO] [RANK 0] wandb_project_name ........... default_project [2024-09-09 16:39:48,962] [INFO] [RANK 0] eval_batch_size .............. 1 [2024-09-09 16:39:48,962] [INFO] [RANK 0] eval_iters ................... 1 [2024-09-09 16:39:48,962] [INFO] [RANK 0] eval_interval ................ 10 [2024-09-09 16:39:48,962] [INFO] [RANK 0] strict_eval .................. False [2024-09-09 16:39:48,962] [INFO] [RANK 0] train_data ................... ['/root/CogVideo/sat/datasets/test'] [2024-09-09 16:39:48,962] [INFO] [RANK 0] train_data_weights ........... None [2024-09-09 16:39:48,962] [INFO] [RANK 0] iterable_dataset ............. False [2024-09-09 16:39:48,963] [INFO] [RANK 0] iterable_dataset_eval ........ [2024-09-09 16:39:48,963] [INFO] [RANK 0] batch_from_same_dataset ...... False [2024-09-09 16:39:48,963] [INFO] [RANK 0] valid_data ................... ['/root/CogVideo/sat/datasets/test'] [2024-09-09 16:39:48,963] [INFO] [RANK 0] test_data .................... None [2024-09-09 16:39:48,963] [INFO] [RANK 0] split ........................ 1,0,0 [2024-09-09 16:39:48,963] [INFO] [RANK 0] num_workers .................. 8 [2024-09-09 16:39:48,963] [INFO] [RANK 0] block_size ................... 10000 [2024-09-09 16:39:48,963] [INFO] [RANK 0] prefetch_factor .............. 4 [2024-09-09 16:39:48,963] [INFO] [RANK 0] deepspeed .................... True [2024-09-09 16:39:48,963] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False} [2024-09-09 16:39:48,963] [INFO] [RANK 0] deepscale .................... False [2024-09-09 16:39:48,963] [INFO] [RANK 0] deepscale_config ............. None [2024-09-09 16:39:48,963] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}} [2024-09-09 16:39:48,963] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}} [2024-09-09 16:39:48,963] [INFO] [RANK 0] cuda ......................... True [2024-09-09 16:39:48,963] [INFO] [RANK 0] rank ......................... 0 [2024-09-09 16:39:48,963] [INFO] [RANK 0] world_size ................... 1 [2024-09-09 16:39:48,964] [INFO] [RANK 0] deepspeed_activation_checkpointing True [2024-09-09 16:39:48,964] [INFO] [RANK 0] master_ip .................... localhost [2024-09-09 16:39:48,964] [INFO] [RANK 0] master_port .................. 39375 [2024-09-09 16:39:48,964] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'lora-test', 'mode': 'finetune', 'load': '/root/CogVideo/CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 100, 'eval_iters': 1, 'eval_interval': 10, 'eval_batch_size': 1, 'save': 'ckpts_2b_lora', 'save_interval': 50, 'log_interval': 20, 'train_data': ['/root/CogVideo/sat/datasets/test'], 'valid_data': ['/root/CogVideo/sat/datasets/test'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}] [2024-09-09 16:39:48,964] [INFO] [RANK 0] do_train ..................... True [2024-09-09 16:39:48,964] [INFO] [RANK 0] val_last_shape ............... [] [2024-09-09 16:39:48,964] [INFO] [RANK 0] val_drop_number .............. 0 [2024-09-09 16:39:48,964] [INFO] [RANK 0] do_valid ..................... True [2024-09-09 16:39:48,964] [INFO] [RANK 0] do_test ...................... False [2024-09-09 16:39:48,964] [INFO] [RANK 0] iteration .................... 0 [2024-09-09 16:40:39,276] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information [2024-09-09 16:40:39,276] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False [2024-09-09 16:40:39,276] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers [2024-09-09 16:40:39,276] [INFO] [checkpointing.py:545:forward] ----Synchronization False [2024-09-09 16:40:39,276] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False [2024-09-09 16:40:49,239] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648 [2024-09-09 16:41:14,908] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824 [rank0]: Traceback (most recent call last): [rank0]: File "/root/CogVideo/sat/train_video.py", line 226, in [rank0]: training_main( [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 157, in training_main [rank0]: iteration, skipped = train(model, optimizer, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 359, in train [rank0]: lm_loss, skipped_iter, metrics = train_step(train_data_iterator, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 443, in train_step [rank0]: forward_ret = forward_step(data_iterator, model, args, timers, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/CogVideo/sat/train_video.py", line 176, in forward_step [rank0]: batch = next(data_iterator) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__ [rank0]: data = self._next_data() [rank0]: ^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data [rank0]: return self._process_data(data) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data [rank0]: data.reraise() [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise [rank0]: raise exception [rank0]: ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 2. [rank0]: Original Traceback (most recent call last): [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch [rank0]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank0]: ~~~~~~~~~~~~^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 360, in __getitem__ [rank0]: return self.wrapped_data[index] [rank0]: ~~~~~~~~~~~~~~~~~^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 342, in __getitem__ [rank0]: return self.datasets[dataset_idx][sample_idx] [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ [rank0]: File "/root/CogVideo/sat/data_video.py", line 411, in __getitem__ [rank0]: indices = np.arange(start, end, (end - start) // num_frames).astype(int) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: ZeroDivisionError: division by zero ```
[2nd Trial] Selecting videos with no more than 50 frames ``` (cogvideo) root@alphacode-ttv-a100-80g-gpu:~/CogVideo/sat# bash finetune_single_gpu.sh RUN on alphacode-ttv-a100-80g-gpu, CUDA_VISIBLE_DEVICES= WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 5243 [2024-09-09 16:57:30,500] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32) no module 'xformers'. Processing without... no module 'xformers'. Processing without... [2024-09-09 16:57:35,259] [INFO] using world size: 1 [2024-09-09 16:57:35,259] [INFO] Will override arguments with manually specified deepspeed_config! [W909 16:57:35.100558963 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [ip6-localhost]:57495 (errno: 97 - Address family not supported by protocol). [W909 16:57:35.104642776 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [alphacode-ttv-a100-80g-gpu]:57495 (errno: 97 - Address family not supported by protocol). [2024-09-09 16:57:35,282] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-09-09 16:57:35,283] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-09 16:57:35,516] [INFO] [RANK 0] building SATVideoDiffusionEngine model ... [2024-09-09 16:57:44,744] [WARNING] [RANK 0] Failed to load bitsandbytes:No module named 'bitsandbytes' [2024-09-09 16:57:44,744] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-09-09 16:57:44,781] [INFO] [RANK 0] replacing layer 1 attention with lora [2024-09-09 16:57:44,816] [INFO] [RANK 0] replacing layer 2 attention with lora [2024-09-09 16:57:44,841] [INFO] [RANK 0] replacing layer 3 attention with lora [2024-09-09 16:57:44,863] [INFO] [RANK 0] replacing layer 4 attention with lora [2024-09-09 16:57:44,885] [INFO] [RANK 0] replacing layer 5 attention with lora [2024-09-09 16:57:44,907] [INFO] [RANK 0] replacing layer 6 attention with lora [2024-09-09 16:57:44,982] [INFO] [RANK 0] replacing layer 7 attention with lora [2024-09-09 16:57:45,090] [INFO] [RANK 0] replacing layer 8 attention with lora [2024-09-09 16:57:45,159] [INFO] [RANK 0] replacing layer 9 attention with lora [2024-09-09 16:57:45,273] [INFO] [RANK 0] replacing layer 10 attention with lora [2024-09-09 16:57:45,422] [INFO] [RANK 0] replacing layer 11 attention with lora [2024-09-09 16:57:45,550] [INFO] [RANK 0] replacing layer 12 attention with lora [2024-09-09 16:57:45,658] [INFO] [RANK 0] replacing layer 13 attention with lora [2024-09-09 16:57:45,774] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-09-09 16:57:45,905] [INFO] [RANK 0] replacing layer 15 attention with lora [2024-09-09 16:57:46,027] [INFO] [RANK 0] replacing layer 16 attention with lora [2024-09-09 16:57:46,102] [INFO] [RANK 0] replacing layer 17 attention with lora [2024-09-09 16:57:46,195] [INFO] [RANK 0] replacing layer 18 attention with lora [2024-09-09 16:57:46,302] [INFO] [RANK 0] replacing layer 19 attention with lora [2024-09-09 16:57:46,347] [INFO] [RANK 0] replacing layer 20 attention with lora [2024-09-09 16:57:46,375] [INFO] [RANK 0] replacing layer 21 attention with lora [2024-09-09 16:57:46,397] [INFO] [RANK 0] replacing layer 22 attention with lora [2024-09-09 16:57:46,419] [INFO] [RANK 0] replacing layer 23 attention with lora [2024-09-09 16:57:46,440] [INFO] [RANK 0] replacing layer 24 attention with lora [2024-09-09 16:57:46,461] [INFO] [RANK 0] replacing layer 25 attention with lora [2024-09-09 16:57:46,483] [INFO] [RANK 0] replacing layer 26 attention with lora [2024-09-09 16:57:46,504] [INFO] [RANK 0] replacing layer 27 attention with lora [2024-09-09 16:57:46,526] [INFO] [RANK 0] replacing layer 28 attention with lora [2024-09-09 16:57:46,547] [INFO] [RANK 0] replacing layer 29 attention with lora Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.01s/it] Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False Working with z of shape (1, 16, 32, 32) = 16384 dimensions. /root/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(path, map_location="cpu")["state_dict"] Deleting key loss.logvar from state_dict. Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict. Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict. Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.to_logits.0.weight from state_dict. Deleting key loss.discriminator.to_logits.0.bias from state_dict. Deleting key loss.discriminator.to_logits.3.weight from state_dict. Deleting key loss.discriminator.to_logits.3.bias from state_dict. Missing keys: [] Unexpected keys: [] Restored from /root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt [2024-09-09 16:57:50,806] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755 [2024-09-09 16:58:00,971] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(checkpoint_name, map_location='cpu') [2024-09-09 16:58:02,528] [INFO] [RANK 0] > successfully loaded /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt [2024-09-09 16:58:03,506] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 ***** [2024-09-09 16:58:03,506] [INFO] [RANK 0] [, , ] is set to no_weight_decay [2024-09-09 16:58:03,509] [INFO] [RANK 0] Syncing initialized parameters... [2024-09-09 16:58:03,623] [INFO] [RANK 0] Finished syncing initialized parameters. [2024-09-09 16:58:03,624] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat. [2024-09-09 16:58:03,624] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown [2024-09-09 16:58:03,625] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-09-09 16:58:03,717] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py312_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py312_cu121/fused_ema_adam/build.ninja... /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module fused_ema_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_ema_adam... Time to load fused_ema_adam op: 0.6912670135498047 seconds [2024-09-09 16:58:04,567] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer [2024-09-09 16:58:04,567] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-09-09 16:58:04,587] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam [2024-09-09 16:58:04,587] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type= [2024-09-09 16:58:04,587] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution ***** [2024-09-09 16:58:04,587] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000 [2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000 [2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False [2024-09-09 16:58:04,587] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False [2024-09-09 16:58:06,802] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2024-09-09 16:58:06,803] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB [2024-09-09 16:58:06,803] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 32.85 GB, percent = 1.7% [2024-09-09 16:58:07,025] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2024-09-09 16:58:07,025] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB [2024-09-09 16:58:07,025] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 32.71 GB, percent = 1.7% [2024-09-09 16:58:07,025] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized [2024-09-09 16:58:07,246] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2024-09-09 16:58:07,246] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB [2024-09-09 16:58:07,246] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 32.93 GB, percent = 1.7% [2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2024-09-09 16:58:07,251] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]] [2024-09-09 16:58:07,254] [INFO] [config.py:997:print] DeepSpeedEngine configuration: [2024-09-09 16:58:07,254] [INFO] [config.py:1001:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-09-09 16:58:07,254] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-09-09 16:58:07,254] [INFO] [config.py:1001:print] amp_enabled .................. False [2024-09-09 16:58:07,254] [INFO] [config.py:1001:print] amp_params ................... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] bfloat16_enabled ............. False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] comms_config ................. [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] communication_data_type ...... None [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] dataloader_drop_last ......... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] disable_allgather ............ False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] dump_state ................... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1 [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0 [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100 [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06 [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01 [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] elasticity_enabled ........... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] fp16_auto_cast ............... False [2024-09-09 16:58:07,255] [INFO] [config.py:1001:print] fp16_enabled ................. True [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] global_rank .................. 0 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] grad_accum_dtype ............. None [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] graph_harvesting ............. False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] load_universal_checkpoint .... False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] loss_scale ................... 0 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] memory_breakdown ............. False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] mics_shard_size .............. -1 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] optimizer_name ............... None [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] optimizer_params ............. None [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] pld_enabled .................. False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] pld_params ................... False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] prescale_gradients ........... False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] scheduler_name ............... None [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] scheduler_params ............. None [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] sparse_attention ............. None [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] steps_per_print .............. 50 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] train_batch_size ............. 2 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 2 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] use_node_local_storage ....... False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] weight_quantization_config ... None [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] world_size ................... 1 [2024-09-09 16:58:07,256] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True [2024-09-09 16:58:07,257] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-09-09 16:58:07,257] [INFO] [config.py:1001:print] zero_enabled ................. True [2024-09-09 16:58:07,257] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True [2024-09-09 16:58:07,257] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2 [2024-09-09 16:58:07,257] [INFO] [config.py:987:print_user_config] json = { "train_micro_batch_size_per_gpu": 2, "gradient_accumulation_steps": 1, "steps_per_print": 50, "gradient_clipping": 0.1, "zero_optimization": { "stage": 2, "cpu_offload": false, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "allgather_bucket_size": 1.000000e+09, "load_from_fp32_weights": false }, "zero_allow_untested_optimizer": true, "bf16": { "enabled": false }, "fp16": { "enabled": true }, "loss_scale": 0, "loss_scale_window": 400, "hysteresis": 2, "min_loss_scale": 1, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false }, "wall_clock_breakdown": false } [2024-09-09 16:58:07,257] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0 [2024-09-09 16:58:07,257] [INFO] [RANK 0] Finetuning Model... [2024-09-09 16:58:07,257] [INFO] [RANK 0] arguments: [2024-09-09 16:58:07,257] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml'] [2024-09-09 16:58:07,257] [INFO] [RANK 0] model_parallel_size .......... 1 [2024-09-09 16:58:07,257] [INFO] [RANK 0] force_pretrain ............... False [2024-09-09 16:58:07,257] [INFO] [RANK 0] device ....................... 0 [2024-09-09 16:58:07,257] [INFO] [RANK 0] debug ........................ False [2024-09-09 16:58:07,257] [INFO] [RANK 0] log_image .................... True [2024-09-09 16:58:07,257] [INFO] [RANK 0] output_dir ................... samples [2024-09-09 16:58:07,257] [INFO] [RANK 0] input_dir .................... None [2024-09-09 16:58:07,257] [INFO] [RANK 0] input_type ................... cli [2024-09-09 16:58:07,257] [INFO] [RANK 0] input_file ................... input.txt [2024-09-09 16:58:07,257] [INFO] [RANK 0] final_size ................... 2048 [2024-09-09 16:58:07,257] [INFO] [RANK 0] sdedit ....................... False [2024-09-09 16:58:07,257] [INFO] [RANK 0] grid_num_rows ................ 1 [2024-09-09 16:58:07,257] [INFO] [RANK 0] force_inference .............. False [2024-09-09 16:58:07,257] [INFO] [RANK 0] lcm_steps .................... None [2024-09-09 16:58:07,257] [INFO] [RANK 0] sampling_num_frames .......... 32 [2024-09-09 16:58:07,257] [INFO] [RANK 0] sampling_fps ................. 8 [2024-09-09 16:58:07,258] [INFO] [RANK 0] only_save_latents ............ False [2024-09-09 16:58:07,258] [INFO] [RANK 0] only_log_video_latents ....... True [2024-09-09 16:58:07,258] [INFO] [RANK 0] latent_channels .............. 32 [2024-09-09 16:58:07,258] [INFO] [RANK 0] image2video .................. False [2024-09-09 16:58:07,258] [INFO] [RANK 0] experiment_name .............. lora-test-09-09-16-57 [2024-09-09 16:58:07,258] [INFO] [RANK 0] train_iters .................. 100 [2024-09-09 16:58:07,258] [INFO] [RANK 0] batch_size ................... 2 [2024-09-09 16:58:07,258] [INFO] [RANK 0] lr ........................... 0.001 [2024-09-09 16:58:07,258] [INFO] [RANK 0] mode ......................... finetune [2024-09-09 16:58:07,258] [INFO] [RANK 0] seed ......................... 5243 [2024-09-09 16:58:07,258] [INFO] [RANK 0] zero_stage ................... 0 [2024-09-09 16:58:07,258] [INFO] [RANK 0] checkpoint_activations ....... True [2024-09-09 16:58:07,258] [INFO] [RANK 0] checkpoint_num_layers ........ 1 [2024-09-09 16:58:07,258] [INFO] [RANK 0] checkpoint_skip_layers ....... 0 [2024-09-09 16:58:07,258] [INFO] [RANK 0] fp16 ......................... True [2024-09-09 16:58:07,258] [INFO] [RANK 0] bf16 ......................... False [2024-09-09 16:58:07,258] [INFO] [RANK 0] gradient_accumulation_steps .. 1 [2024-09-09 16:58:07,258] [INFO] [RANK 0] profiling .................... -1 [2024-09-09 16:58:07,258] [INFO] [RANK 0] epochs ....................... None [2024-09-09 16:58:07,258] [INFO] [RANK 0] log_interval ................. 20 [2024-09-09 16:58:07,258] [INFO] [RANK 0] summary_dir .................. [2024-09-09 16:58:07,258] [INFO] [RANK 0] save_args .................... False [2024-09-09 16:58:07,258] [INFO] [RANK 0] lr_decay_iters ............... None [2024-09-09 16:58:07,258] [INFO] [RANK 0] lr_decay_style ............... linear [2024-09-09 16:58:07,258] [INFO] [RANK 0] lr_decay_ratio ............... 0.1 [2024-09-09 16:58:07,258] [INFO] [RANK 0] warmup ....................... 0.01 [2024-09-09 16:58:07,258] [INFO] [RANK 0] weight_decay ................. 0.0001 [2024-09-09 16:58:07,258] [INFO] [RANK 0] save ......................... ckpts_2b_lora/lora-test-09-09-16-57 [2024-09-09 16:58:07,258] [INFO] [RANK 0] load ......................... /root/CogVideo/CogVideoX-2b-sat/transformer [2024-09-09 16:58:07,258] [INFO] [RANK 0] force_train .................. True [2024-09-09 16:58:07,258] [INFO] [RANK 0] save_interval ................ 50 [2024-09-09 16:58:07,258] [INFO] [RANK 0] no_save_rng .................. False [2024-09-09 16:58:07,258] [INFO] [RANK 0] no_load_rng .................. True [2024-09-09 16:58:07,259] [INFO] [RANK 0] resume_dataloader ............ False [2024-09-09 16:58:07,259] [INFO] [RANK 0] distributed_backend .......... nccl [2024-09-09 16:58:07,259] [INFO] [RANK 0] local_rank ................... 0 [2024-09-09 16:58:07,259] [INFO] [RANK 0] exit_interval ................ None [2024-09-09 16:58:07,259] [INFO] [RANK 0] wandb ........................ False [2024-09-09 16:58:07,259] [INFO] [RANK 0] wandb_project_name ........... default_project [2024-09-09 16:58:07,259] [INFO] [RANK 0] eval_batch_size .............. 1 [2024-09-09 16:58:07,259] [INFO] [RANK 0] eval_iters ................... 1 [2024-09-09 16:58:07,259] [INFO] [RANK 0] eval_interval ................ 10 [2024-09-09 16:58:07,259] [INFO] [RANK 0] strict_eval .................. False [2024-09-09 16:58:07,259] [INFO] [RANK 0] train_data ................... ['/root/CogVideo/sat/datasets/test'] [2024-09-09 16:58:07,259] [INFO] [RANK 0] train_data_weights ........... None [2024-09-09 16:58:07,259] [INFO] [RANK 0] iterable_dataset ............. False [2024-09-09 16:58:07,259] [INFO] [RANK 0] iterable_dataset_eval ........ [2024-09-09 16:58:07,259] [INFO] [RANK 0] batch_from_same_dataset ...... False [2024-09-09 16:58:07,259] [INFO] [RANK 0] valid_data ................... ['/root/CogVideo/sat/datasets/test'] [2024-09-09 16:58:07,259] [INFO] [RANK 0] test_data .................... None [2024-09-09 16:58:07,259] [INFO] [RANK 0] split ........................ 1,0,0 [2024-09-09 16:58:07,259] [INFO] [RANK 0] num_workers .................. 8 [2024-09-09 16:58:07,259] [INFO] [RANK 0] block_size ................... 10000 [2024-09-09 16:58:07,259] [INFO] [RANK 0] prefetch_factor .............. 4 [2024-09-09 16:58:07,259] [INFO] [RANK 0] deepspeed .................... True [2024-09-09 16:58:07,259] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False} [2024-09-09 16:58:07,259] [INFO] [RANK 0] deepscale .................... False [2024-09-09 16:58:07,259] [INFO] [RANK 0] deepscale_config ............. None [2024-09-09 16:58:07,260] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}} [2024-09-09 16:58:07,260] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}} [2024-09-09 16:58:07,260] [INFO] [RANK 0] cuda ......................... True [2024-09-09 16:58:07,260] [INFO] [RANK 0] rank ......................... 0 [2024-09-09 16:58:07,260] [INFO] [RANK 0] world_size ................... 1 [2024-09-09 16:58:07,260] [INFO] [RANK 0] deepspeed_activation_checkpointing True [2024-09-09 16:58:07,260] [INFO] [RANK 0] master_ip .................... localhost [2024-09-09 16:58:07,260] [INFO] [RANK 0] master_port .................. 57495 [2024-09-09 16:58:07,260] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'lora-test', 'mode': 'finetune', 'load': '/root/CogVideo/CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 100, 'eval_iters': 1, 'eval_interval': 10, 'eval_batch_size': 1, 'save': 'ckpts_2b_lora', 'save_interval': 50, 'log_interval': 20, 'train_data': ['/root/CogVideo/sat/datasets/test'], 'valid_data': ['/root/CogVideo/sat/datasets/test'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}] [2024-09-09 16:58:07,260] [INFO] [RANK 0] do_train ..................... True [2024-09-09 16:58:07,260] [INFO] [RANK 0] val_last_shape ............... [] [2024-09-09 16:58:07,260] [INFO] [RANK 0] val_drop_number .............. 0 [2024-09-09 16:58:07,260] [INFO] [RANK 0] do_valid ..................... True [2024-09-09 16:58:07,260] [INFO] [RANK 0] do_test ...................... False [2024-09-09 16:58:07,260] [INFO] [RANK 0] iteration .................... 0 [2024-09-09 16:58:56,248] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information [2024-09-09 16:58:56,248] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False [2024-09-09 16:58:56,248] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers [2024-09-09 16:58:56,248] [INFO] [checkpointing.py:545:forward] ----Synchronization False [2024-09-09 16:58:56,248] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False [2024-09-09 16:59:06,008] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648 [2024-09-09 16:59:29,703] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824 [2024-09-09 16:59:53,115] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912 [2024-09-09 17:01:04,649] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456 [2024-09-09 17:01:51,938] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728 /root/CogVideo/sat/train_video.py:67: DeprecationWarning: torch.get_autocast_gpu_dtype() is deprecated. Please use torch.get_autocast_dtype('cuda') instead. (Triggered internally at ../torch/csrc/autograd/init.cpp:733.) "dtype": torch.get_autocast_gpu_dtype(), /root/CogVideo/sat/train_video.py:70: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. with torch.no_grad(), torch.cuda.amp.autocast(**gpu_autocast_kwargs): ############################## Sampling setting ############################## Sampler: VPSDEDPMPP2MSampler Discretization: ZeroSNRDDPMDiscretization Guider: DynamicCFG Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 50/51 [01:24<00:01, 1.69s/it] [2024-09-09 17:04:19,474] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------------- [2024-09-09 17:04:19,474] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------- [2024-09-09 17:04:19,474] [INFO] [RANK 0] validation loss at iteration 10 | loss: 1.002032E-01 | PPL: 1.105395E+00 loss 1.002032E-01 | [2024-09-09 17:04:19,474] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------- [2024-09-09 17:05:49,038] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864 [rank0]: Traceback (most recent call last): [rank0]: File "/root/CogVideo/sat/train_video.py", line 226, in [rank0]: training_main( [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 157, in training_main [rank0]: iteration, skipped = train(model, optimizer, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 359, in train [rank0]: lm_loss, skipped_iter, metrics = train_step(train_data_iterator, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 443, in train_step [rank0]: forward_ret = forward_step(data_iterator, model, args, timers, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/CogVideo/sat/train_video.py", line 176, in forward_step [rank0]: batch = next(data_iterator) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__ [rank0]: data = self._next_data() [rank0]: ^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data [rank0]: return self._process_data(data) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data [rank0]: data.reraise() [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise [rank0]: raise exception [rank0]: ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 7. [rank0]: Original Traceback (most recent call last): [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch [rank0]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank0]: ~~~~~~~~~~~~^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 360, in __getitem__ [rank0]: return self.wrapped_data[index] [rank0]: ~~~~~~~~~~~~~~~~~^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 342, in __getitem__ [rank0]: return self.datasets[dataset_idx][sample_idx] [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ [rank0]: File "/root/CogVideo/sat/data_video.py", line 411, in __getitem__ [rank0]: indices = np.arange(start, end, (end - start) // num_frames).astype(int) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: ZeroDivisionError: division by zero ```
[3rd Trial] Reduce train_micro_batch_size_per_gpu 2->1 ``` (cogvideo) root@alphacode-ttv-a100-80g-gpu:~/CogVideo/sat# bash finetune_single_gpu.sh RUN on alphacode-ttv-a100-80g-gpu, CUDA_VISIBLE_DEVICES=0 WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 27481 [2024-09-10 13:30:54,235] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32) no module 'xformers'. Processing without... no module 'xformers'. Processing without... [2024-09-10 13:30:59,512] [INFO] using world size: 1 [2024-09-10 13:30:59,512] [INFO] Will override arguments with manually specified deepspeed_config! [W910 13:30:59.341356778 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [ip6-localhost]:44107 (errno: 97 - Address family not supported by protocol). [W910 13:30:59.342068481 socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [alphacode-ttv-a100-80g-gpu]:44107 (errno: 97 - Address family not supported by protocol). [2024-09-10 13:30:59,519] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-09-10 13:30:59,520] [INFO] [comm.py:637:init_distributed] cdb=None [2024-09-10 13:30:59,755] [INFO] [RANK 0] building SATVideoDiffusionEngine model ... [2024-09-10 13:31:08,092] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-09-10 13:31:08,207] [INFO] [RANK 0] replacing layer 1 attention with lora [2024-09-10 13:31:08,324] [INFO] [RANK 0] replacing layer 2 attention with lora [2024-09-10 13:31:08,439] [INFO] [RANK 0] replacing layer 3 attention with lora [2024-09-10 13:31:08,558] [INFO] [RANK 0] replacing layer 4 attention with lora [2024-09-10 13:31:08,691] [INFO] [RANK 0] replacing layer 5 attention with lora [2024-09-10 13:31:08,819] [INFO] [RANK 0] replacing layer 6 attention with lora [2024-09-10 13:31:08,939] [INFO] [RANK 0] replacing layer 7 attention with lora [2024-09-10 13:31:09,068] [INFO] [RANK 0] replacing layer 8 attention with lora [2024-09-10 13:31:09,184] [INFO] [RANK 0] replacing layer 9 attention with lora [2024-09-10 13:31:09,238] [INFO] [RANK 0] replacing layer 10 attention with lora [2024-09-10 13:31:09,258] [INFO] [RANK 0] replacing layer 11 attention with lora [2024-09-10 13:31:09,280] [INFO] [RANK 0] replacing layer 12 attention with lora [2024-09-10 13:31:09,302] [INFO] [RANK 0] replacing layer 13 attention with lora [2024-09-10 13:31:09,324] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-09-10 13:31:09,346] [INFO] [RANK 0] replacing layer 15 attention with lora [2024-09-10 13:31:09,368] [INFO] [RANK 0] replacing layer 16 attention with lora [2024-09-10 13:31:09,389] [INFO] [RANK 0] replacing layer 17 attention with lora [2024-09-10 13:31:09,445] [INFO] [RANK 0] replacing layer 18 attention with lora [2024-09-10 13:31:09,471] [INFO] [RANK 0] replacing layer 19 attention with lora [2024-09-10 13:31:09,494] [INFO] [RANK 0] replacing layer 20 attention with lora [2024-09-10 13:31:09,513] [INFO] [RANK 0] replacing layer 21 attention with lora [2024-09-10 13:31:09,532] [INFO] [RANK 0] replacing layer 22 attention with lora [2024-09-10 13:31:09,551] [INFO] [RANK 0] replacing layer 23 attention with lora [2024-09-10 13:31:09,570] [INFO] [RANK 0] replacing layer 24 attention with lora [2024-09-10 13:31:09,589] [INFO] [RANK 0] replacing layer 25 attention with lora [2024-09-10 13:31:09,609] [INFO] [RANK 0] replacing layer 26 attention with lora [2024-09-10 13:31:09,661] [INFO] [RANK 0] replacing layer 27 attention with lora [2024-09-10 13:31:09,682] [INFO] [RANK 0] replacing layer 28 attention with lora [2024-09-10 13:31:09,705] [INFO] [RANK 0] replacing layer 29 attention with lora Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.13it/s] Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False Working with z of shape (1, 16, 32, 32) = 16384 dimensions. /root/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(path, map_location="cpu")["state_dict"] Deleting key loss.logvar from state_dict. Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict. Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict. Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict. Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict. Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict. Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict. Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict. Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict. Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict. Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict. Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict. Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict. Deleting key loss.discriminator.to_logits.0.weight from state_dict. Deleting key loss.discriminator.to_logits.0.bias from state_dict. Deleting key loss.discriminator.to_logits.3.weight from state_dict. Deleting key loss.discriminator.to_logits.3.bias from state_dict. Missing keys: [] Unexpected keys: [] Restored from /root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt [2024-09-10 13:31:15,450] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755 [2024-09-10 13:31:26,160] [INFO] [RANK 0] global rank 0 is loading checkpoint /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/model_io.py:286: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. sd = torch.load(checkpoint_name, map_location='cpu') [2024-09-10 13:31:27,666] [INFO] [RANK 0] > successfully loaded /root/CogVideo/CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt [2024-09-10 13:31:28,191] [INFO] [RANK 0] ***** Total trainable parameters: 58982400 ***** [2024-09-10 13:31:28,191] [INFO] [RANK 0] [, , ] is set to no_weight_decay [2024-09-10 13:31:28,194] [INFO] [RANK 0] Syncing initialized parameters... [2024-09-10 13:31:28,302] [INFO] [RANK 0] Finished syncing initialized parameters. [2024-09-10 13:31:28,302] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat. [2024-09-10 13:31:28,302] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown [2024-09-10 13:31:28,303] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-09-10 13:31:28,390] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /root/.cache/torch_extensions/py312_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py312_cu121/fused_ema_adam/build.ninja... /root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module fused_ema_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_ema_adam... Time to load fused_ema_adam op: 0.7197697162628174 seconds [2024-09-10 13:31:29,264] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer [2024-09-10 13:31:29,264] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-09-10 13:31:29,284] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam [2024-09-10 13:31:29,284] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type= [2024-09-10 13:31:29,284] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution ***** [2024-09-10 13:31:29,284] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 1000000000 [2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 1000000000 [2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False [2024-09-10 13:31:29,284] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False [2024-09-10 13:31:31,672] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2024-09-10 13:31:31,673] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB [2024-09-10 13:31:31,673] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 28.67 GB, percent = 1.5% [2024-09-10 13:31:31,880] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2024-09-10 13:31:31,880] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB [2024-09-10 13:31:31,880] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 28.41 GB, percent = 1.5% [2024-09-10 13:31:31,880] [INFO] [stage_1_and_2.py:543:__init__] optimizer state initialized [2024-09-10 13:31:32,107] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2024-09-10 13:31:32,107] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB [2024-09-10 13:31:32,107] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 28.58 GB, percent = 1.5% [2024-09-10 13:31:32,111] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2024-09-10 13:31:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2024-09-10 13:31:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2024-09-10 13:31:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]] [2024-09-10 13:31:32,114] [INFO] [config.py:997:print] DeepSpeedEngine configuration: [2024-09-10 13:31:32,114] [INFO] [config.py:1001:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] amp_enabled .................. False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] amp_params ................... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] bfloat16_enabled ............. False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] comms_config ................. [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] communication_data_type ...... None [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] dataloader_drop_last ......... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] disable_allgather ............ False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] dump_state ................... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1 [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0 [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100 [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06 [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01 [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] elasticity_enabled ........... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] fp16_auto_cast ............... False [2024-09-10 13:31:32,115] [INFO] [config.py:1001:print] fp16_enabled ................. True [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] global_rank .................. 0 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] grad_accum_dtype ............. None [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] graph_harvesting ............. False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] load_universal_checkpoint .... False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] loss_scale ................... 0 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] memory_breakdown ............. False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] mics_shard_size .............. -1 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] optimizer_name ............... None [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] optimizer_params ............. None [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] pld_enabled .................. False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] pld_params ................... False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] prescale_gradients ........... False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] scheduler_name ............... None [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] scheduler_params ............. None [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] sparse_attention ............. None [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] steps_per_print .............. 50 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] train_batch_size ............. 1 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 1 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] use_data_before_expert_parallel_ False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] use_node_local_storage ....... False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] weight_quantization_config ... None [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] world_size ................... 1 [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_enabled ................. True [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True [2024-09-10 13:31:32,116] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2 [2024-09-10 13:31:32,117] [INFO] [config.py:987:print_user_config] json = { "train_micro_batch_size_per_gpu": 1, "gradient_accumulation_steps": 1, "steps_per_print": 50, "gradient_clipping": 0.1, "zero_optimization": { "stage": 2, "cpu_offload": false, "contiguous_gradients": false, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "allgather_bucket_size": 1.000000e+09, "load_from_fp32_weights": false }, "zero_allow_untested_optimizer": true, "bf16": { "enabled": false }, "fp16": { "enabled": true }, "loss_scale": 0, "loss_scale_window": 400, "hysteresis": 2, "min_loss_scale": 1, "activation_checkpointing": { "partition_activations": false, "contiguous_memory_optimization": false }, "wall_clock_breakdown": false } [2024-09-10 13:31:32,117] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0 [2024-09-10 13:31:32,117] [INFO] [RANK 0] Finetuning Model... [2024-09-10 13:31:32,117] [INFO] [RANK 0] arguments: [2024-09-10 13:31:32,117] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml'] [2024-09-10 13:31:32,117] [INFO] [RANK 0] model_parallel_size .......... 1 [2024-09-10 13:31:32,117] [INFO] [RANK 0] force_pretrain ............... False [2024-09-10 13:31:32,117] [INFO] [RANK 0] device ....................... 0 [2024-09-10 13:31:32,117] [INFO] [RANK 0] debug ........................ False [2024-09-10 13:31:32,117] [INFO] [RANK 0] log_image .................... True [2024-09-10 13:31:32,117] [INFO] [RANK 0] output_dir ................... samples [2024-09-10 13:31:32,117] [INFO] [RANK 0] input_dir .................... None [2024-09-10 13:31:32,117] [INFO] [RANK 0] input_type ................... cli [2024-09-10 13:31:32,117] [INFO] [RANK 0] input_file ................... input.txt [2024-09-10 13:31:32,117] [INFO] [RANK 0] final_size ................... 2048 [2024-09-10 13:31:32,117] [INFO] [RANK 0] sdedit ....................... False [2024-09-10 13:31:32,117] [INFO] [RANK 0] grid_num_rows ................ 1 [2024-09-10 13:31:32,117] [INFO] [RANK 0] force_inference .............. False [2024-09-10 13:31:32,117] [INFO] [RANK 0] lcm_steps .................... None [2024-09-10 13:31:32,117] [INFO] [RANK 0] sampling_num_frames .......... 32 [2024-09-10 13:31:32,117] [INFO] [RANK 0] sampling_fps ................. 8 [2024-09-10 13:31:32,117] [INFO] [RANK 0] only_save_latents ............ False [2024-09-10 13:31:32,117] [INFO] [RANK 0] only_log_video_latents ....... True [2024-09-10 13:31:32,117] [INFO] [RANK 0] latent_channels .............. 32 [2024-09-10 13:31:32,117] [INFO] [RANK 0] image2video .................. False [2024-09-10 13:31:32,117] [INFO] [RANK 0] experiment_name .............. lora-test-09-10-13-30 [2024-09-10 13:31:32,117] [INFO] [RANK 0] train_iters .................. 100 [2024-09-10 13:31:32,117] [INFO] [RANK 0] batch_size ................... 1 [2024-09-10 13:31:32,117] [INFO] [RANK 0] lr ........................... 0.001 [2024-09-10 13:31:32,117] [INFO] [RANK 0] mode ......................... finetune [2024-09-10 13:31:32,117] [INFO] [RANK 0] seed ......................... 27481 [2024-09-10 13:31:32,117] [INFO] [RANK 0] zero_stage ................... 0 [2024-09-10 13:31:32,117] [INFO] [RANK 0] checkpoint_activations ....... True [2024-09-10 13:31:32,117] [INFO] [RANK 0] checkpoint_num_layers ........ 1 [2024-09-10 13:31:32,117] [INFO] [RANK 0] checkpoint_skip_layers ....... 0 [2024-09-10 13:31:32,118] [INFO] [RANK 0] fp16 ......................... True [2024-09-10 13:31:32,118] [INFO] [RANK 0] bf16 ......................... False [2024-09-10 13:31:32,118] [INFO] [RANK 0] gradient_accumulation_steps .. 1 [2024-09-10 13:31:32,118] [INFO] [RANK 0] profiling .................... -1 [2024-09-10 13:31:32,118] [INFO] [RANK 0] epochs ....................... None [2024-09-10 13:31:32,118] [INFO] [RANK 0] log_interval ................. 20 [2024-09-10 13:31:32,118] [INFO] [RANK 0] summary_dir .................. [2024-09-10 13:31:32,118] [INFO] [RANK 0] save_args .................... False [2024-09-10 13:31:32,118] [INFO] [RANK 0] lr_decay_iters ............... None [2024-09-10 13:31:32,118] [INFO] [RANK 0] lr_decay_style ............... linear [2024-09-10 13:31:32,118] [INFO] [RANK 0] lr_decay_ratio ............... 0.1 [2024-09-10 13:31:32,118] [INFO] [RANK 0] warmup ....................... 0.01 [2024-09-10 13:31:32,118] [INFO] [RANK 0] weight_decay ................. 0.0001 [2024-09-10 13:31:32,118] [INFO] [RANK 0] save ......................... ckpts_2b_lora/lora-test-09-10-13-30 [2024-09-10 13:31:32,118] [INFO] [RANK 0] load ......................... /root/CogVideo/CogVideoX-2b-sat/transformer [2024-09-10 13:31:32,118] [INFO] [RANK 0] force_train .................. True [2024-09-10 13:31:32,118] [INFO] [RANK 0] save_interval ................ 50 [2024-09-10 13:31:32,118] [INFO] [RANK 0] no_save_rng .................. False [2024-09-10 13:31:32,118] [INFO] [RANK 0] no_load_rng .................. True [2024-09-10 13:31:32,118] [INFO] [RANK 0] resume_dataloader ............ False [2024-09-10 13:31:32,118] [INFO] [RANK 0] distributed_backend .......... nccl [2024-09-10 13:31:32,118] [INFO] [RANK 0] local_rank ................... 0 [2024-09-10 13:31:32,118] [INFO] [RANK 0] exit_interval ................ None [2024-09-10 13:31:32,118] [INFO] [RANK 0] wandb ........................ False [2024-09-10 13:31:32,118] [INFO] [RANK 0] wandb_project_name ........... default_project [2024-09-10 13:31:32,118] [INFO] [RANK 0] eval_batch_size .............. 1 [2024-09-10 13:31:32,118] [INFO] [RANK 0] eval_iters ................... 1 [2024-09-10 13:31:32,118] [INFO] [RANK 0] eval_interval ................ 10 [2024-09-10 13:31:32,118] [INFO] [RANK 0] strict_eval .................. False [2024-09-10 13:31:32,118] [INFO] [RANK 0] train_data ................... ['/root/CogVideo/sat/datasets/test'] [2024-09-10 13:31:32,118] [INFO] [RANK 0] train_data_weights ........... None [2024-09-10 13:31:32,118] [INFO] [RANK 0] iterable_dataset ............. False [2024-09-10 13:31:32,118] [INFO] [RANK 0] iterable_dataset_eval ........ [2024-09-10 13:31:32,118] [INFO] [RANK 0] batch_from_same_dataset ...... False [2024-09-10 13:31:32,118] [INFO] [RANK 0] valid_data ................... ['/root/CogVideo/sat/datasets/test'] [2024-09-10 13:31:32,118] [INFO] [RANK 0] test_data .................... None [2024-09-10 13:31:32,118] [INFO] [RANK 0] split ........................ 1,0,0 [2024-09-10 13:31:32,118] [INFO] [RANK 0] num_workers .................. 8 [2024-09-10 13:31:32,118] [INFO] [RANK 0] block_size ................... 10000 [2024-09-10 13:31:32,118] [INFO] [RANK 0] prefetch_factor .............. 4 [2024-09-10 13:31:32,118] [INFO] [RANK 0] deepspeed .................... True [2024-09-10 13:31:32,118] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 1, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False} [2024-09-10 13:31:32,118] [INFO] [RANK 0] deepscale .................... False [2024-09-10 13:31:32,118] [INFO] [RANK 0] deepscale_config ............. None [2024-09-10 13:31:32,119] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}} [2024-09-10 13:31:32,119] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}} [2024-09-10 13:31:32,119] [INFO] [RANK 0] cuda ......................... True [2024-09-10 13:31:32,119] [INFO] [RANK 0] rank ......................... 0 [2024-09-10 13:31:32,119] [INFO] [RANK 0] world_size ................... 1 [2024-09-10 13:31:32,119] [INFO] [RANK 0] deepspeed_activation_checkpointing True [2024-09-10 13:31:32,119] [INFO] [RANK 0] master_ip .................... localhost [2024-09-10 13:31:32,119] [INFO] [RANK 0] master_port .................. 44107 [2024-09-10 13:31:32,119] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': '/root/CogVideo/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': '/root/CogVideo/CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'lora-test', 'mode': 'finetune', 'load': '/root/CogVideo/CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 100, 'eval_iters': 1, 'eval_interval': 10, 'eval_batch_size': 1, 'save': 'ckpts_2b_lora', 'save_interval': 50, 'log_interval': 20, 'train_data': ['/root/CogVideo/sat/datasets/test'], 'valid_data': ['/root/CogVideo/sat/datasets/test'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 1, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}] [2024-09-10 13:31:32,119] [INFO] [RANK 0] do_train ..................... True [2024-09-10 13:31:32,119] [INFO] [RANK 0] val_last_shape ............... [] [2024-09-10 13:31:32,119] [INFO] [RANK 0] val_drop_number .............. 0 [2024-09-10 13:31:32,119] [INFO] [RANK 0] do_valid ..................... True [2024-09-10 13:31:32,119] [INFO] [RANK 0] do_test ...................... False [2024-09-10 13:31:32,119] [INFO] [RANK 0] iteration .................... 0 [2024-09-10 13:32:00,623] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information [2024-09-10 13:32:00,623] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False [2024-09-10 13:32:00,623] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers [2024-09-10 13:32:00,623] [INFO] [checkpointing.py:545:forward] ----Synchronization False [2024-09-10 13:32:00,623] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False [2024-09-10 13:32:06,525] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648 [2024-09-10 13:32:15,902] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824 [2024-09-10 13:32:24,779] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912 [2024-09-10 13:32:33,800] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456 [2024-09-10 13:32:43,291] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728 [2024-09-10 13:33:28,030] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864 /root/CogVideo/sat/train_video.py:67: DeprecationWarning: torch.get_autocast_gpu_dtype() is deprecated. Please use torch.get_autocast_dtype('cuda') instead. (Triggered internally at ../torch/csrc/autograd/init.cpp:733.) "dtype": torch.get_autocast_gpu_dtype(), /root/CogVideo/sat/train_video.py:70: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. with torch.no_grad(), torch.cuda.amp.autocast(**gpu_autocast_kwargs): ############################## Sampling setting ############################## Sampler: VPSDEDPMPP2MSampler Discretization: ZeroSNRDDPMDiscretization Guider: DynamicCFG Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 50/51 [01:24<00:01, 1.70s/it] [2024-09-10 13:34:59,554] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------------- [2024-09-10 13:34:59,555] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------- [2024-09-10 13:34:59,555] [INFO] [RANK 0] validation loss at iteration 10 | loss: 1.391026E-01 | PPL: 1.149242E+00 loss 1.391026E-01 | [2024-09-10 13:34:59,555] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------- [2024-09-10 13:35:16,965] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432 [2024-09-10 13:36:28,365] [INFO] [RANK 0] iteration 20/ 100 | elapsed time per iteration (ms): 14758.9 | learning rate 5.000E-05 | total loss 1.892787E-01 | loss 1.892786E-01 | loss scale 33554432.0 |speed 4.07 samples/(min*GPU) [2024-09-10 13:36:28,366] [INFO] [RANK 0] after 20 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 38562.94677734375 | cached: 18572.0 | max cached: 53186.0 [2024-09-10 13:36:28,367] [INFO] [RANK 0] time (ms) | forward: 4717.04 | backward: 5432.07 | allreduce: 0.00 | optimizer: 32.39 | data loader: 90.08 ############################## Sampling setting ############################## Sampler: VPSDEDPMPP2MSampler Discretization: ZeroSNRDDPMDiscretization Guider: DynamicCFG Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 50/51 [01:24<00:01, 1.70s/it] [2024-09-10 13:37:59,450] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------------- [2024-09-10 13:37:59,450] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------- [2024-09-10 13:37:59,450] [INFO] [RANK 0] validation loss at iteration 20 | loss: 1.256772E-01 | PPL: 1.133916E+00 loss 1.256772E-01 | [2024-09-10 13:37:59,450] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------- ############################## Sampling setting ############################## Sampler: VPSDEDPMPP2MSampler Discretization: ZeroSNRDDPMDiscretization Guider: DynamicCFG Sampling with VPSDEDPMPP2MSampler for 51 steps: 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 50/51 [01:25<00:01, 1.70s/it] [2024-09-10 13:40:59,756] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------------- [2024-09-10 13:40:59,756] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------- [2024-09-10 13:40:59,756] [INFO] [RANK 0] validation loss at iteration 30 | loss: 2.129551E-01 | PPL: 1.237329E+00 loss 2.129551E-01 | [2024-09-10 13:40:59,756] [INFO] [RANK 0] ---------------------------------------------------------------------------------------------- [rank0]: Traceback (most recent call last): [rank0]: File "/root/CogVideo/sat/train_video.py", line 226, in [rank0]: training_main( [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 157, in training_main [rank0]: iteration, skipped = train(model, optimizer, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 359, in train [rank0]: lm_loss, skipped_iter, metrics = train_step(train_data_iterator, [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/training/deepspeed_training.py", line 443, in train_step [rank0]: forward_ret = forward_step(data_iterator, model, args, timers, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/CogVideo/sat/train_video.py", line 176, in forward_step [rank0]: batch = next(data_iterator) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__ [rank0]: data = self._next_data() [rank0]: ^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data [rank0]: return self._process_data(data) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data [rank0]: data.reraise() [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise [rank0]: raise exception [rank0]: ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 6. [rank0]: Original Traceback (most recent call last): [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop [rank0]: data = fetcher.fetch(index) # type: ignore[possibly-undefined] [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch [rank0]: data = [self.dataset[idx] for idx in possibly_batched_index] [rank0]: ~~~~~~~~~~~~^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 360, in __getitem__ [rank0]: return self.wrapped_data[index] [rank0]: ~~~~~~~~~~~~~~~~~^^^^^^^ [rank0]: File "/root/miniconda3/envs/cogvideo/lib/python3.12/site-packages/sat/data_utils/configure_data.py", line 342, in __getitem__ [rank0]: return self.datasets[dataset_idx][sample_idx] [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ [rank0]: File "/root/CogVideo/sat/data_video.py", line 411, in __getitem__ [rank0]: indices = np.arange(start, end, (end - start) // num_frames).astype(int) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: ZeroDivisionError: division by zero DONE on alphacode-ttv-a100-80g-gpu (cogvideo) root@alphacode-ttv-a100-80g-gpu:~/CogVideo/sat# ```
AoqunJin commented 6 days ago

Same issue on A100 80G I tried 2b and 5b version (fp16 & bf16) Reduced rl from 1e-3 to 1e-5 (see https://github.com/THUDM/ChatGLM-6B/issues/1008) but same error

tengjiayan20 commented 5 days ago

It is normal to skip when the loss is large at the beginning of training. You can find that a small number of steps will be skipped in the first 50 steps. Once the training is stable, it will not happen again.

AoqunJin commented 5 days ago

Yes, that is right. @tengjiayan20 It recovered after few steps training:

[2024-09-11 17:52:11,320] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-11 17:52:18,030] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-11 17:52:32,563] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-11 17:52:47,082] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-11 17:53:15,865] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-11 17:53:58,933] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
[2024-09-11 17:58:33,295] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
[2024-09-11 18:00:42,520] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
[2024-09-11 18:04:05,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=7, lr=[5e-05], mom=[[0.9, 0.95]]
[2024-09-11 18:07:56,739] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216
[2024-09-11 18:16:06,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=8, lr=[5e-05], mom=[[0.9, 0.95]]
[2024-09-11 18:16:06,712] [INFO] [RANK 0]  iteration      100/   10000 | elapsed time per iteration (ms): 14623.5 | learning rate 5.000E-05 | total loss 1.992110E-01 | loss 1.992110E-01 | loss scale 16777216.0 |speed 8.21 samples/(min*GPU)
[2024-09-11 18:16:06,713] [INFO] [RANK 0] after 100 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 64453.90478515625 | cached: 22772.0 | max cached: 79914.0
[2024-09-11 18:16:06,713] [INFO] [RANK 0] time (ms) | forward: 9524.11 | backward: 5073.59 | allreduce: 0.00 | optimizer: 24.71 | data loader: 67.04

Thanks a lot.