Open radna0 opened 2 months ago
yes
@yunkchen What setting did you have for training 960x960, I am training with this script, and It runs out of memory for 144 frame video length? I have also changed the buckets for 960 model training. I'm running the script on A100s 80GB
export MODEL_NAME="models/Diffusion_Transformer/EasyAnimateV3-XL-2-InP-960x960"
export DATASET_NAME="datasets/0/"
export DATASET_META_NAME="datasets/0/train_anime.json"
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
NCCL_DEBUG=INFO
# When train model with multi machines, use "--config_file accelerate.yaml" instead of "--mixed_precision='bf16'".
accelerate launch --mixed_precision="bf16" scripts/train_960.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--train_data_dir=$DATASET_NAME \
--train_data_meta=$DATASET_META_NAME \
--config_path "config/easyanimate_video_slicevae_motion_module_v3.yaml" \
--image_sample_size=960 \
--video_sample_size=960 \
--video_sample_stride=1 \
--video_sample_n_frames=72 \
--train_batch_size=1 \
--video_repeat=1 \
--gradient_accumulation_steps=1 \
--dataloader_num_workers=16 \
--num_train_epochs=1 \
--checkpointing_steps=500 \
--learning_rate=2e-05 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=100 \
--seed=42 \
--low_vram \
--output_dir="output_dir" \
--enable_xformers_memory_efficient_attention \
--gradient_checkpointing \
--mixed_precision="bf16" \
--adam_weight_decay=3e-2 \
--adam_epsilon=1e-10 \
--max_grad_norm=1 \
--vae_mini_batch=1 \
--random_frame_crop \
--enable_bucket \
--train_mode="inpaint" \
--trainable_modules "transformer_blocks" "proj_out" "pos_embed" "long_connect_fc"
ASPECT_RATIO_960 = {
"0.25": [480.0, 1920.0],
"0.26": [480.0, 1862.0],
"0.27": [480.0, 1800.0],
"0.28": [480.0, 1738.0],
"0.32": [540.0, 1688.0],
"0.33": [540.0, 1620.0],
"0.35": [540.0, 1560.0],
"0.4": [600.0, 1500.0],
"0.42": [600.0, 1440.0],
"0.48": [660.0, 1380.0],
"0.5": [660.0, 1320.0],
"0.52": [660.0, 1260.0],
"0.57": [720.0, 1260.0],
"0.6": [720.0, 1200.0],
"0.68": [784.0, 1152.0],
"0.72": [784.0, 1082.0],
"0.78": [842.0, 1082.0],
"0.82": [842.0, 1020.0],
"0.88": [900.0, 1020.0],
"0.94": [900.0, 960.0],
"1.0": [960.0, 960.0],
"1.07": [960.0, 900.0],
"1.13": [1020.0, 900.0],
"1.21": [1020.0, 842.0],
"1.29": [1082.0, 842.0],
"1.38": [1082.0, 784.0],
"1.46": [1144.0, 784.0],
"1.67": [1200.0, 720.0],
"1.75": [1260.0, 720.0],
"2.0": [1320.0, 660.0],
"2.09": [1380.0, 660.0],
"2.4": [1440.0, 600.0],
"2.5": [1500.0, 600.0],
"2.89": [1560.0, 540.0],
"3.0": [1620.0, 540.0],
"3.11": [1688.0, 540.0],
"3.62": [1738.0, 480.0],
"3.75": [1800.0, 480.0],
"3.88": [1862.0, 480.0],
"4.0": [1920.0, 480.0],
}
ASPECT_RATIO_RANDOM_CROP_960 = {
"0.42": [600.0, 1440.0],
"0.5": [660.0, 1320.0],
"0.57": [720.0, 1260.0],
"0.68": [784.0, 1152.0],
"0.78": [842.0, 1082.0],
"0.88": [900.0, 1020.0],
"0.94": [900.0, 960.0],
"1.0": [960.0, 960.0],
"1.07": [960.0, 900.0],
"1.13": [1020.0, 900.0],
"1.29": [1082.0, 842.0],
"1.46": [1144.0, 784.0],
"1.75": [1260.0, 720.0],
"2.0": [1320.0, 660.0],
"2.4": [1440.0, 600.0],
}
You need to use the low_vram mode. Deepspeed will be added in the next version to save video memory.
@bubbliiiing yes, I have added the --low_vram option in my script above. But it still can't train on 960x960 144 frames
@bubbliiiing Might you also implement Model Parallelism with Deepspeed as well? I believe they provide good support for that. It would be really really nice to still be able to do inference on many smaller memory-bound gpus.
@bubbliiiingๆฏ็๏ผๆๅจไธ้ข็่ๆฌไธญๆทปๅ ไบ --low_vram ้้กนใไฝๅฎไป็ถๆ ๆณๅจ 960x960 144 ๅธงไธ่ฟ่ก่ฎญ็ป
Let me see why
@bubbliiiingๆจๆฏๅฆไนๅฏไปฅไฝฟ็จ Deepspeed ๅฎ็ฐๆจกๅๅนถ่ก๏ผๆ็ธไฟกไปไปฌไธบๆญคๆไพไบๅพๅฅฝ็ๆฏๆใๅฆๆไป็ถๅจ่ฎธๅค่พๅฐ็ๅ ๅญๅ้ GPU ไธ่ฟ่กๆจ็๏ผ้ฃๅฐฑๅคชๅฅฝไบ
There is a little difference. We have an offload code of transformer3d to CPU. I forgot to upload it. We will soon upload a code that can be trained based on deepspeed. This will be better.
Training fails? @bubbliiiing
Steps: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 315/315 [5:58:58<00:00, 68.13s/it, lr=2e-5, step_loss=0.000223The config attributes {'slice_compression_vae': True, 'use_tiling': True, 'mid_block_attention_type': '3d', 'mini_batch_encoder': 8, 'mini_batch_decoder': 2} were passed to AutoencoderKL, but are not expected and will be ignored. Please verify your config.json configuration file.
{'latents_std', 'latents_mean'} was not found in config. Values will be initialized to default values.
Loading pipeline components...: 0%| | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/EasyAnimate/scripts/train_960.py", line 2437, in <module>
main()
File "/home/EasyAnimate/scripts/train_960.py", line 2426, in main
pipeline = EasyAnimatePipeline.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/pipeline_utils.py", line 881, in from_pretrained
loaded_sub_model = load_sub_model(
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/pipeline_loading_utils.py", line 703, in load_sub_model
loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/modeling_utils.py", line 632, in from_pretrained
model = cls.from_config(config, **unused_kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/configuration_utils.py", line 260, in from_config
model = cls(**init_dict)
File "/usr/local/lib/python3.10/dist-packages/diffusers/configuration_utils.py", line 658, in inner_init
init(self, *args, **init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 91, in __init__
self.encoder = Encoder(
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/autoencoders/vae.py", line 103, in __init__
down_block = get_down_block(
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_blocks.py", line 249, in get_down_block
raise ValueError(f"{down_block_type} does not exist.")
ValueError: SpatialDownBlock3D does not exist.
Steps: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 315/315 [5:59:01<00:00, 68.38s/it, lr=2e-5, step_loss=0.000223]
[2024-08-01 20:26:40,156] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 4376) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1066, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train_960.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-01_20:26:40
host : dpm4
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4376)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Here is what I have for the 960x960 training script. Do you need to change video_sample_size and image_sample_size to 960 to train the 960x960 model?