Closed chenxinli001 closed 4 weeks ago
nop, can you share the log?
(cogvideo) ubuntu@instance-butter:/data3/cx_workspace/CogV/CogVideo/sat$ bash finetune_single_gpu.sh
RUN on instance-butter, CUDA_VISIBLE_DEVICES=6
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 22338
[2024-09-03 02:36:29,937] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...)
is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda')
instead.
def forward(ctx, input, weight, bias=None):
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...)
is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda')
instead.
def backward(ctx, grad_output):
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...)
is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda')
instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
[2024-09-03 02:36:34,890] [INFO] using world size: 1
[2024-09-03 02:36:34,891] [INFO] Will override arguments with manually specified deepspeed_config!
[2024-09-03 02:36:34,893] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-03 02:36:34,894] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-03 02:36:34,922] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-03 02:36:44,771] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-03 02:36:44,904] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-03 02:36:45,021] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-03 02:36:45,131] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-03 02:36:45,241] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-03 02:36:45,352] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-03 02:36:45,466] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-03 02:36:45,575] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-03 02:36:45,687] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-03 02:36:45,851] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-03 02:36:45,957] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-03 02:36:46,063] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-03 02:36:46,173] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-03 02:36:46,280] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-03 02:36:46,387] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-03 02:36:46,495] [INFO] [RANK 0] replacing layer 15 attention with lora
[2024-09-03 02:36:46,606] [INFO] [RANK 0] replacing layer 16 attention with lora
[2024-09-03 02:36:46,761] [INFO] [RANK 0] replacing layer 17 attention with lora
[2024-09-03 02:36:46,901] [INFO] [RANK 0] replacing layer 18 attention with lora
[2024-09-03 02:36:47,044] [INFO] [RANK 0] replacing layer 19 attention with lora
[2024-09-03 02:36:47,171] [INFO] [RANK 0] replacing layer 20 attention with lora
[2024-09-03 02:36:47,291] [INFO] [RANK 0] replacing layer 21 attention with lora
[2024-09-03 02:36:47,397] [INFO] [RANK 0] replacing layer 22 attention with lora
[2024-09-03 02:36:47,506] [INFO] [RANK 0] replacing layer 23 attention with lora
[2024-09-03 02:36:47,610] [INFO] [RANK 0] replacing layer 24 attention with lora
[2024-09-03 02:36:47,774] [INFO] [RANK 0] replacing layer 25 attention with lora
[2024-09-03 02:36:47,881] [INFO] [RANK 0] replacing layer 26 attention with lora
[2024-09-03 02:36:47,986] [INFO] [RANK 0] replacing layer 27 attention with lora
[2024-09-03 02:36:48,095] [INFO] [RANK 0] replacing layer 28 attention with lora
[2024-09-03 02:36:48,208] [INFO] [RANK 0] replacing layer 29 attention with lora
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.28s/it]
Initialized embedder #0: FrozenT5Embedder with 4762310656 params. Trainable: False
Working with z of shape (1, 16, 32, 32) = 16384 dimensions.
/data3/cx_workspace/CogV/CogVideo/sat/vae_modules/autoencoder.py:565: FutureWarning: You are using torch.load
with weights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only
will be flipped to True
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals
. We recommend you start setting weights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(path, map_location="cpu")["state_dict"]
Deleting key loss.logvar from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict.
Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict.
Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict.
Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict.
Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.0.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.1.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.2.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample_res.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.0.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.net.2.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.weight from state_dict.
Deleting key loss.discriminator.blocks.3.downsample.conv.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.4.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.4.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.weight from state_dict.
Deleting key loss.discriminator.blocks.5.0.downsample.1.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.5.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.conv_res.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.0.net.2.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_q.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_kv.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.0.fn.attn.to_out.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.norm.gamma from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.0.bias from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.weight from state_dict.
Deleting key loss.discriminator.blocks.6.1.1.fn.net.2.bias from state_dict.
Deleting key loss.discriminator.to_logits.0.weight from state_dict.
Deleting key loss.discriminator.to_logits.0.bias from state_dict.
Deleting key loss.discriminator.to_logits.3.weight from state_dict.
Deleting key loss.discriminator.to_logits.3.bias from state_dict.
Missing keys: []
Unexpected keys: []
Restored from CogVideoX-2b-sat/vae/3d-vae.pt
[2024-09-03 02:36:56,856] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 6764790755
[2024-09-03 02:37:15,810] [INFO] [RANK 0] global rank 0 is loading checkpoint CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/sat/training/model_io.py:286: FutureWarning: You are using torch.load
with weights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only
will be flipped to True
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals
. We recommend you start setting weights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
sd = torch.load(checkpoint_name, map_location='cpu')
[2024-09-03 02:37:17,758] [INFO] [RANK 0] > successfully loaded CogVideoX-2b-sat/transformer/1000/mp_rank_00_model_states.pt
[2024-09-03 02:37:18,437] [INFO] [RANK 0] Total trainable parameters: 58982400
[2024-09-03 02:37:18,437] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay
[2024-09-03 02:37:18,440] [INFO] [RANK 0] Syncing initialized parameters...
[2024-09-03 02:37:18,503] [INFO] [RANK 0] Finished syncing initialized parameters.
[2024-09-03 02:37:18,503] [INFO] [RANK 0] Using optimizer sat.ops.FusedEmaAdam from sat.
[2024-09-03 02:37:18,503] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.4, git-hash=unknown, git-branch=unknown
[2024-09-03 02:37:18,503] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-09-03 02:37:18,646] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /data1/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /data1/.cache/torch_extensions/py310_cu121/fused_ema_adam/build.ninja...
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_ema_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_ema_adam...
Time to load fused_ema_adam op: 0.07278060913085938 seconds
[2024-09-03 02:37:18,724] [INFO] [logging.py:96:log_dist] [Rank 0] Using client callable to create basic optimizer
[2024-09-03 02:37:18,725] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-09-03 02:37:18,762] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedEmaAdam
[2024-09-03 02:37:18,763] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedEmaAdam type=<class 'sat.ops.fused_ema_adam.FusedEmaAdam'>
[2024-09-03 02:37:18,763] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] ** You are using ZeRO with an untested optimizer, proceed with caution ***
[2024-09-03 02:37:18,763] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 1000000000
[2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 1000000000
[2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False
[2024-09-03 02:37:18,763] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False
[2024-09-03 02:37:23,295] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-09-03 02:37:23,295] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.97 GB CA 13.23 GB Max_CA 13 GB
[2024-09-03 02:37:23,295] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 478.67 GB, percent = 23.7%
[2024-09-03 02:37:23,814] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-09-03 02:37:23,814] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 13.08 GB CA 13.45 GB Max_CA 13 GB
[2024-09-03 02:37:23,814] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 483.06 GB, percent = 24.0%
[2024-09-03 02:37:23,815] [INFO] [stage_1_and_2.py:543:init] optimizer state initialized
[2024-09-03 02:37:24,129] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-09-03 02:37:24,130] [INFO] [utils.py:782:see_memory_usage] MA 12.86 GB Max_MA 12.86 GB CA 13.45 GB Max_CA 13 GB
[2024-09-03 02:37:24,130] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 485.83 GB, percent = 24.1%
[2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-09-03 02:37:24,134] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[1.0], mom=[[0.9, 0.95]]
[2024-09-03 02:37:24,137] [INFO] [config.py:997:print] DeepSpeedEngine configuration:
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] amp_enabled .................. False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] amp_params ................... False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] bfloat16_enabled ............. False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] bfloat16_immediate_grad_update False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_parallel_write_pipeline False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_tag_validation_enabled True
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] checkpoint_tag_validation_fail False
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x15507917fbb0>
[2024-09-03 02:37:24,137] [INFO] [config.py:1001:print] communication_data_type ...... None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] curriculum_enabled_legacy .... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] curriculum_params_legacy ..... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] data_efficiency_enabled ...... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dataloader_drop_last ......... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] disable_allgather ............ False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dump_state ................... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] dynamic_loss_scale_args ...... None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_enabled ........... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_gas_boundary_resolution 1
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_layer_num ......... 0
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_max_iter .......... 100
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_stability ......... 1e-06
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_tol ............... 0.01
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] eigenvalue_verbose ........... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] elasticity_enabled ........... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_auto_cast ............... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_enabled ................. True
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] fp16_master_weights_and_gradients False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] global_rank .................. 0
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] grad_accum_dtype ............. None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_accumulation_steps .. 1
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_clipping ............ 0.1
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] gradient_predivide_factor .... 1.0
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] graph_harvesting ............. False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] initial_dynamic_scale ........ 65536
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] load_universal_checkpoint .... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] loss_scale ................... 0
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] memory_breakdown ............. False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] mics_hierarchial_params_gather False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] mics_shard_size .............. -1
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_legacy_fusion ...... False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_name ............... None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] optimizer_params ............. None
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pld_enabled .................. False
[2024-09-03 02:37:24,138] [INFO] [config.py:1001:print] pld_params ................... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] prescale_gradients ........... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] scheduler_name ............... None
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] scheduler_params ............. None
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] seq_parallel_communication_data_type torch.float32
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] sparse_attention ............. None
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] sparse_gradients_enabled ..... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] steps_per_print .............. 50
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] timers_config ................ enabled=True synchronized=True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] train_batch_size ............. 2
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] train_micro_batch_size_per_gpu 2
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] use_data_before_expertparallel False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] use_node_local_storage ....... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] wall_clock_breakdown ......... False
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] weight_quantization_config ... None
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] world_size ................... 1
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_allow_untested_optimizer True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_config .................. stage=2 contiguous_gradients=False reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=False elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_enabled ................. True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_force_ds_cpu_optimizer .. True
[2024-09-03 02:37:24,139] [INFO] [config.py:1001:print] zero_optimization_stage ...... 2
[2024-09-03 02:37:24,139] [INFO] [config.py:987:print_user_config] json = {
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 1,
"steps_per_print": 50,
"gradient_clipping": 0.1,
"zero_optimization": {
"stage": 2,
"cpu_offload": false,
"contiguous_gradients": false,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1.000000e+09,
"allgather_bucket_size": 1.000000e+09,
"load_from_fp32_weights": false
},
"zero_allow_untested_optimizer": true,
"bf16": {
"enabled": false
},
"fp16": {
"enabled": true
},
"loss_scale": 0,
"loss_scale_window": 400,
"hysteresis": 2,
"min_loss_scale": 1,
"activation_checkpointing": {
"partition_activations": false,
"contiguous_memory_optimization": false
},
"wall_clock_breakdown": false
}
[2024-09-03 02:37:24,139] [INFO] [RANK 0] learning rate decaying style linear, ratio 10.0
[2024-09-03 02:37:24,139] [INFO] [RANK 0] Finetuning Model...
[2024-09-03 02:37:24,139] [INFO] [RANK 0] arguments:
[2024-09-03 02:37:24,139] [INFO] [RANK 0] base ......................... ['configs/cogvideox_2b_lora.yaml', 'configs/sft.yaml']
[2024-09-03 02:37:24,139] [INFO] [RANK 0] model_parallel_size .......... 1
[2024-09-03 02:37:24,139] [INFO] [RANK 0] force_pretrain ............... False
[2024-09-03 02:37:24,139] [INFO] [RANK 0] device ....................... 0
[2024-09-03 02:37:24,139] [INFO] [RANK 0] debug ........................ False
[2024-09-03 02:37:24,139] [INFO] [RANK 0] log_image .................... True
[2024-09-03 02:37:24,139] [INFO] [RANK 0] output_dir ................... samples
[2024-09-03 02:37:24,139] [INFO] [RANK 0] input_dir .................... None
[2024-09-03 02:37:24,139] [INFO] [RANK 0] input_type ................... cli
[2024-09-03 02:37:24,139] [INFO] [RANK 0] input_file ................... input.txt
[2024-09-03 02:37:24,139] [INFO] [RANK 0] final_size ................... 2048
[2024-09-03 02:37:24,140] [INFO] [RANK 0] sdedit ....................... False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] grid_num_rows ................ 1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] force_inference .............. False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lcm_steps .................... None
[2024-09-03 02:37:24,140] [INFO] [RANK 0] sampling_num_frames .......... 32
[2024-09-03 02:37:24,140] [INFO] [RANK 0] sampling_fps ................. 8
[2024-09-03 02:37:24,140] [INFO] [RANK 0] only_save_latents ............ False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] only_log_video_latents ....... True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] latent_channels .............. 32
[2024-09-03 02:37:24,140] [INFO] [RANK 0] image2video .................. False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] experiment_name .............. example_data-09-03-02-36
[2024-09-03 02:37:24,140] [INFO] [RANK 0] train_iters .................. 1000
[2024-09-03 02:37:24,140] [INFO] [RANK 0] batch_size ................... 2
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lr ........................... 0.001
[2024-09-03 02:37:24,140] [INFO] [RANK 0] mode ......................... finetune
[2024-09-03 02:37:24,140] [INFO] [RANK 0] seed ......................... 22338
[2024-09-03 02:37:24,140] [INFO] [RANK 0] zero_stage ................... 0
[2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_activations ....... True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_num_layers ........ 1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] checkpoint_skip_layers ....... 0
[2024-09-03 02:37:24,140] [INFO] [RANK 0] fp16 ......................... True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] bf16 ......................... False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] gradient_accumulation_steps .. 1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] profiling .................... -1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] epochs ....................... None
[2024-09-03 02:37:24,140] [INFO] [RANK 0] log_interval ................. 20
[2024-09-03 02:37:24,140] [INFO] [RANK 0] summary_dir ..................
[2024-09-03 02:37:24,140] [INFO] [RANK 0] save_args .................... False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_iters ............... None
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_style ............... linear
[2024-09-03 02:37:24,140] [INFO] [RANK 0] lr_decay_ratio ............... 0.1
[2024-09-03 02:37:24,140] [INFO] [RANK 0] warmup ....................... 0.01
[2024-09-03 02:37:24,140] [INFO] [RANK 0] weight_decay ................. 0.0001
[2024-09-03 02:37:24,140] [INFO] [RANK 0] save ......................... ckpts_2b/example_data-09-03-02-36
[2024-09-03 02:37:24,140] [INFO] [RANK 0] load ......................... CogVideoX-2b-sat/transformer
[2024-09-03 02:37:24,140] [INFO] [RANK 0] force_train .................. True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] save_interval ................ 500
[2024-09-03 02:37:24,140] [INFO] [RANK 0] no_save_rng .................. False
[2024-09-03 02:37:24,140] [INFO] [RANK 0] no_load_rng .................. True
[2024-09-03 02:37:24,140] [INFO] [RANK 0] resume_dataloader ............ False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] distributed_backend .......... nccl
[2024-09-03 02:37:24,141] [INFO] [RANK 0] local_rank ................... 0
[2024-09-03 02:37:24,141] [INFO] [RANK 0] exit_interval ................ None
[2024-09-03 02:37:24,141] [INFO] [RANK 0] wandb ........................ False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] wandb_project_name ........... default_project
[2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_batch_size .............. 1
[2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_iters ................... 1
[2024-09-03 02:37:24,141] [INFO] [RANK 0] eval_interval ................ 100
[2024-09-03 02:37:24,141] [INFO] [RANK 0] strict_eval .................. False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] train_data ................... ['toy_data']
[2024-09-03 02:37:24,141] [INFO] [RANK 0] train_data_weights ........... None
[2024-09-03 02:37:24,141] [INFO] [RANK 0] iterable_dataset ............. False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] iterable_dataset_eval ........
[2024-09-03 02:37:24,141] [INFO] [RANK 0] batch_from_same_dataset ...... False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] valid_data ................... ['toy_data']
[2024-09-03 02:37:24,141] [INFO] [RANK 0] test_data .................... None
[2024-09-03 02:37:24,141] [INFO] [RANK 0] split ........................ 1,0,0
[2024-09-03 02:37:24,141] [INFO] [RANK 0] num_workers .................. 8
[2024-09-03 02:37:24,141] [INFO] [RANK 0] block_size ................... 10000
[2024-09-03 02:37:24,141] [INFO] [RANK 0] prefetch_factor .............. 4
[2024-09-03 02:37:24,141] [INFO] [RANK 0] deepspeed .................... True
[2024-09-03 02:37:24,141] [INFO] [RANK 0] deepspeed_config ............. {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}
[2024-09-03 02:37:24,141] [INFO] [RANK 0] deepscale .................... False
[2024-09-03 02:37:24,141] [INFO] [RANK 0] deepscale_config ............. None
[2024-09-03 02:37:24,141] [INFO] [RANK 0] model_config ................. {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False, 'num_layers': 30, 'hidden_size': 1920, 'num_attention_heads': 30, 'parallel_output': True}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}, 'dtype': 'fp16'}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': 'CogVideoX-2b-sat/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': 'CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}
[2024-09-03 02:37:24,141] [INFO] [RANK 0] data_config .................. {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}
[2024-09-03 02:37:24,141] [INFO] [RANK 0] cuda ......................... True
[2024-09-03 02:37:24,142] [INFO] [RANK 0] rank ......................... 0
[2024-09-03 02:37:24,142] [INFO] [RANK 0] world_size ................... 1
[2024-09-03 02:37:24,142] [INFO] [RANK 0] deepspeed_activation_checkpointing True
[2024-09-03 02:37:24,142] [INFO] [RANK 0] master_ip .................... localhost
[2024-09-03 02:37:24,142] [INFO] [RANK 0] master_port .................. 38137
[2024-09-03 02:37:24,142] [INFO] [RANK 0] log_config ................... [{'model': {'scale_factor': 1.15258426, 'disable_first_stage_autocast': True, 'not_trainable_prefixes': ['all'], 'log_keys': ['txt'], 'denoiser_config': {'target': 'sgm.modules.diffusionmodules.denoiser.DiscreteDenoiser', 'params': {'num_idx': 1000, 'quantize_c_noise': False, 'weighting_config': {'target': 'sgm.modules.diffusionmodules.denoiser_weighting.EpsWeighting'}, 'scaling_config': {'target': 'sgm.modules.diffusionmodules.denoiser_scaling.VideoScaling'}, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}, 'network_config': {'target': 'dit_video_concat.DiffusionTransformer', 'params': {'time_embed_dim': 512, 'elementwise_affine': True, 'num_frames': 49, 'time_compressed_rate': 4, 'latent_width': 90, 'latent_height': 60, 'num_layers': 30, 'patch_size': 2, 'in_channels': 16, 'out_channels': 16, 'hidden_size': 1920, 'adm_in_channels': 256, 'num_attention_heads': 30, 'transformer_args': {'checkpoint_activations': True, 'vocab_size': 1, 'max_sequence_length': 64, 'layernorm_order': 'pre', 'skip_init': False, 'model_parallel_size': 1, 'is_decoder': False}, 'modules': {'pos_embed_config': {'target': 'dit_video_concat.Basic3DPositionEmbeddingMixin', 'params': {'text_length': 226, 'height_interpolation': 1.875, 'width_interpolation': 1.875}}, 'lora_config': {'target': 'sat.model.finetune.lora2.LoraMixin', 'params': {'r': 128}}, 'patch_embed_config': {'target': 'dit_video_concat.ImagePatchEmbeddingMixin', 'params': {'text_hidden_size': 4096}}, 'adaln_layer_config': {'target': 'dit_video_concat.AdaLNMixin', 'params': {'qk_ln': True}}, 'final_layer_config': {'target': 'dit_video_concat.FinalLayerMixin'}}}}, 'conditioner_config': {'target': 'sgm.modules.GeneralConditioner', 'params': {'emb_models': [{'is_trainable': False, 'input_key': 'txt', 'ucg_rate': 0.1, 'target': 'sgm.modules.encoders.modules.FrozenT5Embedder', 'params': {'model_dir': 'CogVideoX-2b-sat/t5-v1_1-xxl', 'max_length': 226}}]}}, 'first_stage_config': {'target': 'vae_modules.autoencoder.VideoAutoencoderInferenceWrapper', 'params': {'cp_size': 1, 'ckpt_path': 'CogVideoX-2b-sat/vae/3d-vae.pt', 'ignore_keys': ['loss'], 'loss_config': {'target': 'torch.nn.Identity'}, 'regularizer_config': {'target': 'vae_modules.regularizers.DiagonalGaussianRegularizer'}, 'encoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelEncoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': True}}, 'decoder_config': {'target': 'vae_modules.cp_enc_dec.ContextParallelDecoder3D', 'params': {'double_z': True, 'z_channels': 16, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 2, 4], 'attn_resolutions': [], 'num_res_blocks': 3, 'dropout': 0.0, 'gather_norm': False}}}}, 'loss_fn_config': {'target': 'sgm.modules.diffusionmodules.loss.VideoDiffusionLoss', 'params': {'offset_noise_level': 0, 'sigma_sampler_config': {'target': 'sgm.modules.diffusionmodules.sigma_sampling.DiscreteSampling', 'params': {'uniform_sampling': True, 'num_idx': 1000, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}}}}}, 'sampler_config': {'target': 'sgm.modules.diffusionmodules.sampling.VPSDEDPMPP2MSampler', 'params': {'num_steps': 50, 'verbose': True, 'discretization_config': {'target': 'sgm.modules.diffusionmodules.discretizer.ZeroSNRDDPMDiscretization', 'params': {'shift_scale': 3.0}}, 'guider_config': {'target': 'sgm.modules.diffusionmodules.guiders.DynamicCFG', 'params': {'scale': 6, 'exp': 5, 'num_steps': 50}}}}}}, {'args': {'checkpoint_activations': True, 'model_parallel_size': 1, 'experiment_name': 'example_data', 'mode': 'finetune', 'load': 'CogVideoX-2b-sat/transformer', 'no_load_rng': True, 'train_iters': 1000, 'eval_iters': 1, 'eval_interval': 100, 'eval_batch_size': 1, 'save': 'ckpts_2b', 'save_interval': 500, 'log_interval': 20, 'train_data': ['toy_data'], 'valid_data': ['toy_data'], 'split': '1,0,0', 'num_workers': 8, 'force_train': True, 'only_log_video_latents': True}, 'data': {'target': 'data_video.SFTDataset', 'params': {'video_size': [480, 720], 'fps': 8, 'max_num_frames': 49, 'skip_frms_num': 3.0}}, 'deepspeed': {'train_micro_batch_size_per_gpu': 2, 'gradient_accumulation_steps': 1, 'steps_per_print': 50, 'gradient_clipping': 0.1, 'zero_optimization': {'stage': 2, 'cpu_offload': False, 'contiguous_gradients': False, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 1000000000, 'allgather_bucket_size': 1000000000, 'load_from_fp32_weights': False}, 'zero_allow_untested_optimizer': True, 'bf16': {'enabled': False}, 'fp16': {'enabled': True}, 'loss_scale': 0, 'loss_scale_window': 400, 'hysteresis': 2, 'min_loss_scale': 1, 'optimizer': {'type': 'sat.ops.FusedEmaAdam', 'params': {'lr': 0.001, 'betas': [0.9, 0.95], 'eps': '1e-8', 'weight_decay': '1e-4'}}, 'activation_checkpointing': {'partition_activations': False, 'contiguous_memory_optimization': False}, 'wall_clock_breakdown': False}}]
[2024-09-03 02:37:24,142] [INFO] [RANK 0] do_train ..................... True
[2024-09-03 02:37:24,142] [INFO] [RANK 0] val_last_shape ............... []
[2024-09-03 02:37:24,142] [INFO] [RANK 0] val_drop_number .............. 0
[2024-09-03 02:37:24,142] [INFO] [RANK 0] do_valid ..................... True
[2024-09-03 02:37:24,142] [INFO] [RANK 0] do_test ...................... False
[2024-09-03 02:37:24,142] [INFO] [RANK 0] iteration .................... 0
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:541:forward] Activation Checkpointing Information
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:542:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:543:forward] ----contiguous Memory Checkpointing False with None total layers
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:545:forward] ----Synchronization False
[2024-09-03 02:38:05,330] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-03 02:38:17,054] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-03 02:38:39,582] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-03 02:39:01,823] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-03 02:39:25,772] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-03 02:40:34,024] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
[2024-09-03 02:42:31,555] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
[2024-09-03 02:43:17,654] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
[2024-09-03 02:45:38,242] [INFO] [RANK 0] iteration 20/ 1000 | elapsed time per iteration (ms): 24611.4 | learning rate 5.000E-05 | total loss 2.157213E-01 | loss 2.157214E-01 | loss scale 33554432.0 |speed 4.88 samples/(minGPU)
[2024-09-03 02:45:38,244] [INFO] [RANK 0] after 20 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 64453.90478515625 | cached: 22772.0 | max cached: 76136.0
[2024-09-03 02:45:38,244] [INFO] [RANK 0] time (ms) | forward: 15575.33 | backward: 8930.58 | allreduce: 0.00 | optimizer: 101.48 | data loader: 19.31
[2024-09-03 02:53:15,005] [INFO] [RANK 0] iteration 40/ 1000 | elapsed time per iteration (ms): 22838.1 | learning rate 5.000E-05 | total loss 2.180460E-01 | loss 2.180460E-01 | loss scale 33554432.0 |speed 5.25 samples/(minGPU)
[2024-09-03 02:53:15,006] [INFO] [RANK 0] time (ms) | forward: 13797.73 | backward: 8961.48 | allreduce: 0.00 | optimizer: 74.96 | data loader: 0.37
[2024-09-03 02:54:01,688] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216
[2024-09-03 02:57:02,117] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=8, lr=[5e-05], mom=[[0.9, 0.95]]
bash finetune_single_gpu.sh
RUN on instance-butter, CUDA_VISIBLE_DEVICES=6
WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1 python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed 22338
[2024-09-03 02:36:29,937] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...)
is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda')
instead.
def forward(ctx, input, weight, bias=None):
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...)
is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda')
instead.
def backward(ctx, grad_output):
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: torch.cuda.amp.custom_fwd(args...)
is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda')
instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/data1/anaconda3/envs/cogvideo/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract
was renamed to torch.library.register_fake
. Please use that instead; we will remove torch.library.impl_abstract
in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
[2024-09-03 02:36:34,890] [INFO] using world size: 1
[2024-09-03 02:36:34,891] [INFO] Will override arguments with manually specified deepspeed_config!
[2024-09-03 02:36:34,893] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-09-03 02:36:34,894] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-09-03 02:36:34,922] [INFO] [RANK 0] building SATVideoDiffusionEngine model ...
[2024-09-03 02:36:44,771] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-09-03 02:36:44,904] [INFO] [RANK 0] replacing layer 1 attention with lora
[2024-09-03 02:36:45,021] [INFO] [RANK 0] replacing layer 2 attention with lora
[2024-09-03 02:36:45,131] [INFO] [RANK 0] replacing layer 3 attention with lora
[2024-09-03 02:36:45,241] [INFO] [RANK 0] replacing layer 4 attention with lora
[2024-09-03 02:36:45,352] [INFO] [RANK 0] replacing layer 5 attention with lora
[2024-09-03 02:36:45,466] [INFO] [RANK 0] replacing layer 6 attention with lora
[2024-09-03 02:36:45,575] [INFO] [RANK 0] replacing layer 7 attention with lora
[2024-09-03 02:36:45,687] [INFO] [RANK 0] replacing layer 8 attention with lora
[2024-09-03 02:36:45,851] [INFO] [RANK 0] replacing layer 9 attention with lora
[2024-09-03 02:36:45,957] [INFO] [RANK 0] replacing layer 10 attention with lora
[2024-09-03 02:36:46,063] [INFO] [RANK 0] replacing layer 11 attention with lora
[2024-09-03 02:36:46,173] [INFO] [RANK 0] replacing layer 12 attention with lora
[2024-09-03 02:36:46,280] [INFO] [RANK 0] replacing layer 13 attention with lora
[2024-09-03 02:36:46,387] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-09-03 02:36:46,495] [INFO] [RANK 0] replacing l[2024-09-03 03:00:52,438] [INFO] [RANK 0] iteration 60/ 1000 | elapsed time per iteration (ms): 22871.6 | learning rate 5.000E-05 | total loss 2.016617E-01 | loss 2.016617E-01 | loss scale 16777216.0 |speed 5.25 samples/(min*GPU)
[2024-09-03 03:00:52,438] [INFO] [RANK 0] time (ms) | forward: 13888.08 | backward: 8902.19 | allreduce: 0.00 | optimizer: 77.79 | data loader: 0.66
跑的是 finetune_single_gpu.sh
export CUDA_VISIBLE_DEVICES=6
echo "RUN on `hostname`, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"
run_cmd="$environs python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
echo ${run_cmd}
eval ${run_cmd}
echo "DONE on `hostname`"
频繁遇到 下面这种 超大的loss scale, 并且显示skip the steps, 正常嘛?
[2024-09-03 02:38:17,054] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648 [2024-09-03 02:38:39,582] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824 [2024-09-03 02:39:01,823] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912 [2024-09-03 02:39:25,772] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456 [2024-09-03 02:40:34,024] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728 [2024-09-03 02:42:31,555] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864 [2024-09-03 02:43:17,654] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
不正常,这个scale也不正常,你的数据集体量是
这个看上去是没有任何报错都是跳过了?
Same behavior here on 4xA100 device 👀
same on 8*A800
Did everyone skip all the steps? @kyrie111 @TianxingWu, because skipping the first few steps and then continuing with normal training, the loss reduction is a normal phenomenon. The first few steps are skipped because the loss is indeed too large
same issue on 8 * A100 80G (Tried Single GPU & Multi GPU 8)
I tried only 2b model
Same issue on A100 80G I tried 2b and 5b version (fp16 & bf16) Reduced rl from 1e-3 to 1e-5 (see https://github.com/THUDM/ChatGLM-6B/issues/1008) but same error
It is normal to skip when the loss is large at the beginning of training. You can find that a small number of steps will be skipped in the first 50 steps. Once the training is stable, it will not happen again.
Yes, that is right. @tengjiayan20 It recovered after few steps training:
[2024-09-11 17:52:11,320] [INFO] [checkpointing.py:546:forward] ----Profiling time in checkpointing False
[2024-09-11 17:52:18,030] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648
[2024-09-11 17:52:32,563] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648, reducing to 1073741824
[2024-09-11 17:52:47,082] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824, reducing to 536870912
[2024-09-11 17:53:15,865] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456
[2024-09-11 17:53:58,933] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728
[2024-09-11 17:58:33,295] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728, reducing to 67108864
[2024-09-11 18:00:42,520] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864, reducing to 33554432
[2024-09-11 18:04:05,636] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=7, lr=[5e-05], mom=[[0.9, 0.95]]
[2024-09-11 18:07:56,739] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432, reducing to 16777216
[2024-09-11 18:16:06,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=8, lr=[5e-05], mom=[[0.9, 0.95]]
[2024-09-11 18:16:06,712] [INFO] [RANK 0] iteration 100/ 10000 | elapsed time per iteration (ms): 14623.5 | learning rate 5.000E-05 | total loss 1.992110E-01 | loss 1.992110E-01 | loss scale 16777216.0 |speed 8.21 samples/(min*GPU)
[2024-09-11 18:16:06,713] [INFO] [RANK 0] after 100 iterations memory (MB) | allocated: 13974.6455078125 | max allocated: 64453.90478515625 | cached: 22772.0 | max cached: 79914.0
[2024-09-11 18:16:06,713] [INFO] [RANK 0] time (ms) | forward: 9524.11 | backward: 5073.59 | allreduce: 0.00 | optimizer: 24.71 | data loader: 67.04
Thanks a lot.
System Info / 系統信息
When I fine-tune CogVideoX-2B, i found that almost all the steps are skipped, and the loss scale is very large.
Information / 问题信息
Reproduction / 复现过程
just run:
! /bin/bash
echo "RUN on
hostname
, CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"environs="WORLD_SIZE=1 RANK=0 LOCAL_RANK=0 LOCAL_WORLD_SIZE=1"
run_cmd="$environs python train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
echo ${run_cmd} eval ${run_cmd}
echo "DONE on
hostname
"Expected behavior / 期待表现
Is this normal?