aigc-apps / EasyAnimate

๐Ÿ“บ An End-to-End Solution for High-Resolution and Long Video Generation Based on Transformer Diffusion
Apache License 2.0
1.21k stars 92 forks source link

960x960 Model's Training Script #85

Open radna0 opened 2 months ago

radna0 commented 2 months ago

Here is what I have for the 960x960 training script. Do you need to change video_sample_size and image_sample_size to 960 to train the 960x960 model?

export MODEL_NAME="models/Diffusion_Transformer/EasyAnimateV3-XL-2-InP-960x960"
export DATASET_NAME="datasets/0/"
 export DATASET_META_NAME="datasets/0/train_anime.json"
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
NCCL_DEBUG=INFO

# When train model with multi machines, use "--config_file accelerate.yaml" instead of "--mixed_precision='bf16'".
accelerate launch --mixed_precision="bf16" scripts/train.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATASET_NAME \
  --train_data_meta=$DATASET_META_NAME \
  --config_path "config/easyanimate_video_slicevae_motion_module_v3.yaml" \
  --image_sample_size=512 \
  --video_sample_size=512 \
  --video_sample_stride=1 \
  --video_sample_n_frames=72 \
  --train_batch_size=1 \
  --video_repeat=1 \
  --gradient_accumulation_steps=1 \
  --dataloader_num_workers=8 \
  --num_train_epochs=100 \
  --checkpointing_steps=500 \
  --learning_rate=2e-05 \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=100 \
  --seed=42 \
  --output_dir="output_dir" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --mixed_precision="bf16" \
  --adam_weight_decay=3e-2 \
  --adam_epsilon=1e-10 \
  --max_grad_norm=1 \
  --vae_mini_batch=1 \
  --random_frame_crop \
  --enable_bucket \
  --train_mode="inpaint" \
  --trainable_modules "transformer_blocks" "proj_out" "pos_embed" "long_connect_fc"
yunkchen commented 2 months ago

yes

radna0 commented 2 months ago

@yunkchen What setting did you have for training 960x960, I am training with this script, and It runs out of memory for 144 frame video length? I have also changed the buckets for 960 model training. I'm running the script on A100s 80GB

export MODEL_NAME="models/Diffusion_Transformer/EasyAnimateV3-XL-2-InP-960x960"
export DATASET_NAME="datasets/0/"
 export DATASET_META_NAME="datasets/0/train_anime.json"
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
NCCL_DEBUG=INFO

# When train model with multi machines, use "--config_file accelerate.yaml" instead of "--mixed_precision='bf16'".
accelerate launch --mixed_precision="bf16" scripts/train_960.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATASET_NAME \
  --train_data_meta=$DATASET_META_NAME \
  --config_path "config/easyanimate_video_slicevae_motion_module_v3.yaml" \
  --image_sample_size=960 \
  --video_sample_size=960 \
  --video_sample_stride=1 \
  --video_sample_n_frames=72 \
  --train_batch_size=1 \
  --video_repeat=1 \
  --gradient_accumulation_steps=1 \
  --dataloader_num_workers=16 \
  --num_train_epochs=1 \
  --checkpointing_steps=500 \
  --learning_rate=2e-05 \
  --lr_scheduler="constant_with_warmup" \
  --lr_warmup_steps=100 \
  --seed=42 \
  --low_vram \
  --output_dir="output_dir" \
  --enable_xformers_memory_efficient_attention \
  --gradient_checkpointing \
  --mixed_precision="bf16" \
  --adam_weight_decay=3e-2 \
  --adam_epsilon=1e-10 \
  --max_grad_norm=1 \
  --vae_mini_batch=1 \
  --random_frame_crop \
  --enable_bucket \
  --train_mode="inpaint" \
  --trainable_modules "transformer_blocks" "proj_out" "pos_embed" "long_connect_fc"

ASPECT_RATIO_960 = {
    "0.25": [480.0, 1920.0],
    "0.26": [480.0, 1862.0],
    "0.27": [480.0, 1800.0],
    "0.28": [480.0, 1738.0],
    "0.32": [540.0, 1688.0],
    "0.33": [540.0, 1620.0],
    "0.35": [540.0, 1560.0],
    "0.4": [600.0, 1500.0],
    "0.42": [600.0, 1440.0],
    "0.48": [660.0, 1380.0],
    "0.5": [660.0, 1320.0],
    "0.52": [660.0, 1260.0],
    "0.57": [720.0, 1260.0],
    "0.6": [720.0, 1200.0],
    "0.68": [784.0, 1152.0],
    "0.72": [784.0, 1082.0],
    "0.78": [842.0, 1082.0],
    "0.82": [842.0, 1020.0],
    "0.88": [900.0, 1020.0],
    "0.94": [900.0, 960.0],
    "1.0": [960.0, 960.0],
    "1.07": [960.0, 900.0],
    "1.13": [1020.0, 900.0],
    "1.21": [1020.0, 842.0],
    "1.29": [1082.0, 842.0],
    "1.38": [1082.0, 784.0],
    "1.46": [1144.0, 784.0],
    "1.67": [1200.0, 720.0],
    "1.75": [1260.0, 720.0],
    "2.0": [1320.0, 660.0],
    "2.09": [1380.0, 660.0],
    "2.4": [1440.0, 600.0],
    "2.5": [1500.0, 600.0],
    "2.89": [1560.0, 540.0],
    "3.0": [1620.0, 540.0],
    "3.11": [1688.0, 540.0],
    "3.62": [1738.0, 480.0],
    "3.75": [1800.0, 480.0],
    "3.88": [1862.0, 480.0],
    "4.0": [1920.0, 480.0],
}

ASPECT_RATIO_RANDOM_CROP_960 = {
    "0.42": [600.0, 1440.0],
    "0.5": [660.0, 1320.0],
    "0.57": [720.0, 1260.0],
    "0.68": [784.0, 1152.0],
    "0.78": [842.0, 1082.0],
    "0.88": [900.0, 1020.0],
    "0.94": [900.0, 960.0],
    "1.0": [960.0, 960.0],
    "1.07": [960.0, 900.0],
    "1.13": [1020.0, 900.0],
    "1.29": [1082.0, 842.0],
    "1.46": [1144.0, 784.0],
    "1.75": [1260.0, 720.0],
    "2.0": [1320.0, 660.0],
    "2.4": [1440.0, 600.0],
}
bubbliiiing commented 2 months ago

You need to use the low_vram mode. Deepspeed will be added in the next version to save video memory.

radna0 commented 2 months ago

@bubbliiiing yes, I have added the --low_vram option in my script above. But it still can't train on 960x960 144 frames

radna0 commented 2 months ago

@bubbliiiing Might you also implement Model Parallelism with Deepspeed as well? I believe they provide good support for that. It would be really really nice to still be able to do inference on many smaller memory-bound gpus.

bubbliiiing commented 2 months ago

@bubbliiiingๆ˜ฏ็š„๏ผŒๆˆ‘ๅœจไธŠ้ข็š„่„šๆœฌไธญๆทปๅŠ ไบ† --low_vram ้€‰้กนใ€‚ไฝ†ๅฎƒไป็„ถๆ— ๆณ•ๅœจ 960x960 144 ๅธงไธŠ่ฟ›่กŒ่ฎญ็ปƒ

Let me see why

bubbliiiing commented 2 months ago

@bubbliiiingๆ‚จๆ˜ฏๅฆไนŸๅฏไปฅไฝฟ็”จ Deepspeed ๅฎž็Žฐๆจกๅž‹ๅนถ่กŒ๏ผŸๆˆ‘็›ธไฟกไป–ไปฌไธบๆญคๆไพ›ไบ†ๅพˆๅฅฝ็š„ๆ”ฏๆŒใ€‚ๅฆ‚ๆžœไป็„ถๅœจ่ฎธๅคš่พƒๅฐ็š„ๅ†…ๅญ˜ๅ—้™ GPU ไธŠ่ฟ›่กŒๆŽจ็†๏ผŒ้‚ฃๅฐฑๅคชๅฅฝไบ†

There is a little difference. We have an offload code of transformer3d to CPU. I forgot to upload it. We will soon upload a code that can be trained based on deepspeed. This will be better.

radna0 commented 2 months ago

Training fails? @bubbliiiing

Steps: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 315/315 [5:58:58<00:00, 68.13s/it, lr=2e-5, step_loss=0.000223The config attributes {'slice_compression_vae': True, 'use_tiling': True, 'mid_block_attention_type': '3d', 'mini_batch_encoder': 8, 'mini_batch_decoder': 2} were passed to AutoencoderKL, but are not expected and will be ignored. Please verify your config.json configuration file.
{'latents_std', 'latents_mean'} was not found in config. Values will be initialized to default values.
Loading pipeline components...:   0%|                                                                            | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/EasyAnimate/scripts/train_960.py", line 2437, in <module>
    main()
  File "/home/EasyAnimate/scripts/train_960.py", line 2426, in main
    pipeline = EasyAnimatePipeline.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/pipeline_utils.py", line 881, in from_pretrained
    loaded_sub_model = load_sub_model(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/pipeline_loading_utils.py", line 703, in load_sub_model
    loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/modeling_utils.py", line 632, in from_pretrained
    model = cls.from_config(config, **unused_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/configuration_utils.py", line 260, in from_config
    model = cls(**init_dict)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/configuration_utils.py", line 658, in inner_init
    init(self, *args, **init_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 91, in __init__
    self.encoder = Encoder(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/vae.py", line 103, in __init__
    down_block = get_down_block(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_blocks.py", line 249, in get_down_block
    raise ValueError(f"{down_block_type} does not exist.")
ValueError: SpatialDownBlock3D does not exist.
Steps: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 315/315 [5:59:01<00:00, 68.38s/it, lr=2e-5, step_loss=0.000223]
[2024-08-01 20:26:40,156] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4376) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1066, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train_960.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-01_20:26:40
  host      : dpm4
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4376)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================