train_text_to_video_sft 中的设置

lijain commented 3 days ago

你好，请问下# Single GPU 训练时 train_text_to_video_sft.sh ACCELERATE_CONFIG_FILE="accelerate_configs/uncompiled_1.yaml" 训练没有问题。我将uncompiled_1.yaml-->deepspeed.yaml换为deepspeed报错，企业微信截图_17325422283473 请问下使用deepspeed该怎么改

zRzRzRzRzRzRzR commented 2 days ago

@zhipuch

zhipuch commented 2 days ago

具体运行环境和报错信息提供一下吧

lijain commented 2 days ago

硬件 A100 80g 环境： absl-py 2.0.0 accelerate 1.0.1 aiofiles 23.2.1 aiohttp 3.8.6 aiosignal 1.3.1 albumentations 0.4.3 altair 5.1.2 annotated-types 0.6.0 antlr4-python3-runtime 4.8 anyio 4.3.0 aoss-python-sdk 2.2.6 appdirs 1.4.4 asttokens 2.2.1 async-timeout 4.0.3 attrs 23.1.0 backcall 0.2.0 backports.zoneinfo 0.2.1 bitsandbytes 0.41.1 blinker 1.6.3 boto3 1.28.75 botocore 1.31.75 cachetools 5.3.2 certifi 2022.12.7 cffi 1.16.0 charset-normalizer 2.1.1 clean-fid 0.1.35 click 8.1.7 clip 1.0 clip-anytorch 2.5.2 cmake 3.25.0 coloredlogs 15.0.1 comm 0.1.3 contourpy 1.1.1 cycler 0.12.1 Cython 3.0.9 dctorch 0.1.2 debugpy 1.6.7 decorator 4.4.2 decord 0.6.0 deepspeed 0.9.5 diffusers 0.32.0.dev0 docker-pycreds 0.4.0 easydict 1.13 einops 0.3.0 environs 9.5.0 exceptiongroup 1.2.0 executing 1.2.0 fastapi 0.110.1 ffmpy 0.3.2 filelock 3.9.0 flash-attn 2.3.3 flatbuffers 24.3.7 fonttools 4.49.0 frozenlist 1.4.0 fsspec 2023.10.0 ftfy 6.1.1 future 0.18.3 gitdb 4.0.11 GitPython 3.1.40 google-auth 2.23.4 google-auth-oauthlib 1.0.0 gradio 3.48.0 gradio_client 0.6.1 grpcio 1.59.2 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.26.2 humanfriendly 10.0 humanize 4.8.0 idna 3.4 imageio 2.9.0 imageio-ffmpeg 0.4.2 imgaug 0.2.6 importlib-metadata 6.8.0 importlib-resources 6.1.0 insightface 0.7.3 install 1.3.5 ipykernel 6.24.0 ipython 8.12.2 jax 0.4.13 jaxlib 0.4.13 jedi 0.18.2 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.2 jsonmerge 1.9.2 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 jupyter_client 8.3.0 jupyter_core 5.3.1 k-diffusion 0.1.1 kiwisolver 1.4.5 kornia 0.6.0 lazy_loader 0.3 lit 15.0.7 llm-infer 0.0.0 Markdown 3.5.1 markdown-it-py 3.0.0 MarkupSafe 2.1.2 marshmallow 3.20.1 matplotlib 3.7.5 matplotlib-inline 0.1.6 mdurl 0.1.2 mediapipe 0.10.11 ml-dtypes 0.2.0 moviepy 1.0.3 mpmath 1.2.1 multidict 6.0.4 multiprocessing-logging 0.3.4 nest-asyncio 1.5.6 networkx 3.0 ninja 1.11.1 numpy 1.24.1 oauthlib 3.2.2 omegaconf 2.1.1 onnx 1.15.0 onnxruntime-gpu 1.17.1 opencv-contrib-python 4.10.0.84 opencv-python-headless 4.9.0.80 opt-einsum 3.3.0 orjson 3.10.0 packaging 23.1 pandas 2.0.3 parso 0.8.3 pathtools 0.1.2 peft 0.12.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.3.0 pip 23.1.2 pkgutil_resolve_name 1.3.10 platformdirs 3.9.1 prettytable 3.10.0 proglog 0.1.10 prompt-toolkit 3.0.39 protobuf 3.20.3 psutil 5.9.5 ptyprocess 0.7.0 pudb 2019.2 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 13.0.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pycparser 2.22 pydantic 1.10.11 pydantic_core 2.16.3 pydeck 0.8.1b0 pyDeprecate 0.3.1 pydub 0.25.1 Pygments 2.15.1 pyparsing 3.1.2 python-dateutil 2.8.2 python-dotenv 1.0.0 python-multipart 0.0.9 pytorch-lightning 1.4.2 pytorch-msssim 1.0.0 pytz 2023.3.post1 PyWavelets 1.4.1 PyYAML 6.0.1 pyzmq 25.1.0 referencing 0.30.2 regex 2023.10.3 requests 2.28.1 requests-oauthlib 1.3.1 rich 13.6.0 rpds-py 0.10.6 rsa 4.9 ruff 0.3.5 s3transfer 0.7.0 safetensors 0.4.5 scikit-image 0.20.0 scikit-learn 1.3.2 scipy 1.9.1 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 1.34.0 setproctitle 1.3.3 setuptools 67.8.0 shellingham 1.5.4 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 sounddevice 0.4.7 stack-data 0.6.2 starlette 0.37.2 streamlit 1.28.0 sympy 1.11.1 taming-transformers 0.0.1
tenacity 8.2.3 tensorboard 2.14.0 tensorboard-data-server 0.7.2 test-tube 0.7.5 threadpoolctl 3.3.0 tifffile 2023.7.10 timm 0.9.8 tokenizers 0.20.3 toml 0.10.2 tomlkit 0.12.0 toolz 0.12.0 torch 2.0.0+cu118 torchaudio 2.0.1+cu118 torchdiffeq 0.2.3 torchmetrics 0.6.0 torchsde 0.2.6 torchvision 0.15.1+cu118 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 trampoline 0.1.2 transformers 4.46.0 triton 2.0.0 typer 0.12.3 typing_extensions 4.11.0 tzdata 2023.3 tzlocal 5.2 urllib3 1.26.13 urwid 2.2.3 uvicorn 0.29.0 validators 0.22.0 wandb 0.15.12 watchdog 3.0.0 wcwidth 0.2.6 websockets 11.0.3 Werkzeug 3.0.1 wheel 0.38.4 yarl 1.9.2 zipp 3.16.2

deepspeed.yaml: compute_environment: LOCAL_MACHINE debug: false deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: false zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

sh setting： export TORCH_LOGS="+dynamo,recompiles,graph_breaks" export TORCHDYNAMO_VERBOSE=1 export WANDB_MODE="offline" export NCCL_P2P_DISABLE=1 export TORCH_NCCL_ENABLE_MONITORING=0 GPU_IDS="0" LEARNING_RATES="1e-4" LR_SCHEDULES="cosine_with_restarts" OPTIMIZERS="adamw" MAX_TRAIN_STEPS="20000"

ACCELERATE_CONFIG_FILE="accelerate_configs/deepspeed.yaml" DATA_ROOT="/dataset/gen_gif/Dance-VideoGeneration-Dataset" CAPTION_COLUMN="captions.txt" VIDEO_COLUMN="videos.txt" output_dir="/nProject/cogvideox-factory/log/cogvideox_sftoptimizer${optimizer}steps${steps}lr-schedule${lr_schedule}_lr${learning_rate}/" accelerate launch --config_file $ACCELERATE_CONFIG_FILE --gpu_ids $GPU_IDS training/cogvideox_image_to_video_sft2.py --pretrained_model_name_or_path /nProject/zpretrain_ckpt/CogVideoX1.5-5B-I2V --data_root $DATA_ROOT --caption_column $CAPTION_COLUMN --video_column $VIDEO_COLUMN --height_buckets 480 --width_buckets 720 --frame_buckets 53 --dataloader_num_workers 8 --pin_memory --num_validation_videos 1 --validation_epochs 1 --seed 42 --mixed_precision bf16 --output_dir $output_dir --max_num_frames 53 --train_batch_size 1 --max_train_steps $MAX_TRAIN_STEPS --checkpointing_steps 2000 --gradient_accumulation_steps 1 --gradient_checkpointing --learning_rate $LEARNING_RATES --lr_scheduler $LR_SCHEDULES --lr_warmup_steps 800 --lr_num_cycles 1 --enable_slicing --enable_tiling --optimizer $OPTIMIZERS --beta1 0.9 --beta2 0.95 --weight_decay 0.001 --max_grad_norm 1.0 --allow_tf32 --nccl_timeout 1800

感谢你的回复，不过目前单卡运行可以。但训练速度很慢 Epoch 0, global step 1, data_time: 2.645, model_time: 16.966, step_loss: 0.36216 Epoch 0, global step 2, data_time: 1.029, model_time: 14.720, step_loss: 0.16998 Epoch 0, global step 3, data_time: 0.975, model_time: 14.779, step_loss: 0.11621 Epoch 0, global step 4, data_time: 0.986, model_time: 14.710, step_loss: 0.14389 Epoch 0, global step 5, data_time: 1.085, model_time: 14.789, step_loss: 0.07038 Epoch 0, global step 6, data_time: 1.279, model_time: 15.023, step_loss: 0.06519 Epoch 0, global step 7, data_time: 0.963, model_time: 14.690, step_loss: 0.08907 Epoch 0, global step 8, data_time: 1.067, model_time: 14.795, step_loss: 0.06434 Epoch 0, global step 9, data_time: 0.977, model_time: 14.616, step_loss: 0.09751 Epoch 0, global step 10, data_time: 1.059, model_time: 14.692, step_loss: 0.07744 Epoch 0, global step 11, data_time: 1.269, model_time: 14.896, step_loss: 0.04926 Epoch 0, global step 12, data_time: 0.877, model_time: 14.500, step_loss: 0.16864 Epoch 0, global step 13, data_time: 1.073, model_time: 14.702, step_loss: 0.15544 Epoch 0, global step 14, data_time: 0.968, model_time: 14.605, step_loss: 0.09388 Epoch 0, global step 15, data_time: 1.164, model_time: 14.892, step_loss: 0.06555 Epoch 0, global step 16, data_time: 0.967, model_time: 14.607, step_loss: 0.06692 Epoch 0, global step 17, data_time: 0.878, model_time: 14.595, step_loss: 0.06572 Epoch 0, global step 18, data_time: 1.072, model_time: 14.699, step_loss: 0.11654

zhipuch commented 2 days ago

硬件 A100 80g 环境： absl-py 2.0.0 accelerate 1.0.1 aiofiles 23.2.1 aiohttp 3.8.6 aiosignal 1.3.1 albumentations 0.4.3 altair 5.1.2 annotated-types 0.6.0 antlr4-python3-runtime 4.8 anyio 4.3.0 aoss-python-sdk 2.2.6 appdirs 1.4.4 asttokens 2.2.1 async-timeout 4.0.3 attrs 23.1.0 backcall 0.2.0 backports.zoneinfo 0.2.1 bitsandbytes 0.41.1 blinker 1.6.3 boto3 1.28.75 botocore 1.31.75 cachetools 5.3.2 certifi 2022.12.7 cffi 1.16.0 charset-normalizer 2.1.1 clean-fid 0.1.35 click 8.1.7 clip 1.0 clip-anytorch 2.5.2 cmake 3.25.0 coloredlogs 15.0.1 comm 0.1.3 contourpy 1.1.1 cycler 0.12.1 Cython 3.0.9 dctorch 0.1.2 debugpy 1.6.7 decorator 4.4.2 decord 0.6.0 deepspeed 0.9.5 diffusers 0.32.0.dev0 docker-pycreds 0.4.0 easydict 1.13 einops 0.3.0 environs 9.5.0 exceptiongroup 1.2.0 executing 1.2.0 fastapi 0.110.1 ffmpy 0.3.2 filelock 3.9.0 flash-attn 2.3.3 flatbuffers 24.3.7 fonttools 4.49.0 frozenlist 1.4.0 fsspec 2023.10.0 ftfy 6.1.1 future 0.18.3 gitdb 4.0.11 GitPython 3.1.40 google-auth 2.23.4 google-auth-oauthlib 1.0.0 gradio 3.48.0 gradio_client 0.6.1 grpcio 1.59.2 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httpx 0.27.0 huggingface-hub 0.26.2 humanfriendly 10.0 humanize 4.8.0 idna 3.4 imageio 2.9.0 imageio-ffmpeg 0.4.2 imgaug 0.2.6 importlib-metadata 6.8.0 importlib-resources 6.1.0 insightface 0.7.3 install 1.3.5 ipykernel 6.24.0 ipython 8.12.2 jax 0.4.13 jaxlib 0.4.13 jedi 0.18.2 Jinja2 3.1.2 jmespath 1.0.1 joblib 1.3.2 jsonmerge 1.9.2 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 jupyter_client 8.3.0 jupyter_core 5.3.1 k-diffusion 0.1.1 kiwisolver 1.4.5 kornia 0.6.0 lazy_loader 0.3 lit 15.0.7 llm-infer 0.0.0 Markdown 3.5.1 markdown-it-py 3.0.0 MarkupSafe 2.1.2 marshmallow 3.20.1 matplotlib 3.7.5 matplotlib-inline 0.1.6 mdurl 0.1.2 mediapipe 0.10.11 ml-dtypes 0.2.0 moviepy 1.0.3 mpmath 1.2.1 multidict 6.0.4 multiprocessing-logging 0.3.4 nest-asyncio 1.5.6 networkx 3.0 ninja 1.11.1 numpy 1.24.1 oauthlib 3.2.2 omegaconf 2.1.1 onnx 1.15.0 onnxruntime-gpu 1.17.1 opencv-contrib-python 4.10.0.84 opencv-python-headless 4.9.0.80 opt-einsum 3.3.0 orjson 3.10.0 packaging 23.1 pandas 2.0.3 parso 0.8.3 pathtools 0.1.2 peft 0.12.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 9.3.0 pip 23.1.2 pkgutil_resolve_name 1.3.10 platformdirs 3.9.1 prettytable 3.10.0 proglog 0.1.10 prompt-toolkit 3.0.39 protobuf 3.20.3 psutil 5.9.5 ptyprocess 0.7.0 pudb 2019.2 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 13.0.0 pyasn1 0.5.0 pyasn1-modules 0.3.0 pycparser 2.22 pydantic 1.10.11 pydantic_core 2.16.3 pydeck 0.8.1b0 pyDeprecate 0.3.1 pydub 0.25.1 Pygments 2.15.1 pyparsing 3.1.2 python-dateutil 2.8.2 python-dotenv 1.0.0 python-multipart 0.0.9 pytorch-lightning 1.4.2 pytorch-msssim 1.0.0 pytz 2023.3.post1 PyWavelets 1.4.1 PyYAML 6.0.1 pyzmq 25.1.0 referencing 0.30.2 regex 2023.10.3 requests 2.28.1 requests-oauthlib 1.3.1 rich 13.6.0 rpds-py 0.10.6 rsa 4.9 ruff 0.3.5 s3transfer 0.7.0 safetensors 0.4.5 scikit-image 0.20.0 scikit-learn 1.3.2 scipy 1.9.1 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 1.34.0 setproctitle 1.3.3 setuptools 67.8.0 shellingham 1.5.4 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 sounddevice 0.4.7 stack-data 0.6.2 starlette 0.37.2 streamlit 1.28.0 sympy 1.11.1 taming-transformers 0.0.1 tenacity 8.2.3 tensorboard 2.14.0 tensorboard-data-server 0.7.2 test-tube 0.7.5 threadpoolctl 3.3.0 tifffile 2023.7.10 timm 0.9.8 tokenizers 0.20.3 toml 0.10.2 tomlkit 0.12.0 toolz 0.12.0 torch 2.0.0+cu118 torchaudio 2.0.1+cu118 torchdiffeq 0.2.3 torchmetrics 0.6.0 torchsde 0.2.6 torchvision 0.15.1+cu118 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 trampoline 0.1.2 transformers 4.46.0 triton 2.0.0 typer 0.12.3 typing_extensions 4.11.0 tzdata 2023.3 tzlocal 5.2 urllib3 1.26.13 urwid 2.2.3 uvicorn 0.29.0 validators 0.22.0 wandb 0.15.12 watchdog 3.0.0 wcwidth 0.2.6 websockets 11.0.3 Werkzeug 3.0.1 wheel 0.38.4 yarl 1.9.2 zipp 3.16.2

deepspeed.yaml: compute_environment: LOCAL_MACHINE debug: false deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: cpu offload_param_device: cpu zero3_init_flag: false zero_stage: 2 distributed_type: DEEPSPEED downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

sh setting： export TORCH_LOGS="+dynamo,recompiles,graph_breaks" export TORCHDYNAMO_VERBOSE=1 export WANDB_MODE="offline" export NCCL_P2P_DISABLE=1 export TORCH_NCCL_ENABLE_MONITORING=0 GPU_IDS="0" LEARNING_RATES="1e-4" LR_SCHEDULES="cosine_with_restarts" OPTIMIZERS="adamw" MAX_TRAIN_STEPS="20000"

ACCELERATE_CONFIG_FILE="accelerate_configs/deepspeed.yaml" DATA_ROOT="/dataset/gen_gif/Dance-VideoGeneration-Dataset" CAPTION_COLUMN="captions.txt" VIDEO_COLUMN="videos.txt" output_dir="/nProject/cogvideox-factory/log/cogvideox_sftoptimizer${optimizer}steps${steps}lr-schedule${lr_schedule}_lr${learning_rate}/" accelerate launch --config_file $ACCELERATE_CONFIG_FILE --gpu_ids $GPU_IDS training/cogvideox_image_to_video_sft2.py --pretrained_model_name_or_path /nProject/zpretrain_ckpt/CogVideoX1.5-5B-I2V --data_root $DATA_ROOT --caption_column $CAPTION_COLUMN --video_column $VIDEO_COLUMN --height_buckets 480 --width_buckets 720 --frame_buckets 53 --dataloader_num_workers 8 --pin_memory --num_validation_videos 1 --validation_epochs 1 --seed 42 --mixed_precision bf16 --output_dir $output_dir --max_num_frames 53 --train_batch_size 1 --max_train_steps $MAX_TRAIN_STEPS --checkpointing_steps 2000 --gradient_accumulation_steps 1 --gradient_checkpointing --learning_rate $LEARNING_RATES --lr_scheduler $LR_SCHEDULES --lr_warmup_steps 800 --lr_num_cycles 1 --enable_slicing --enable_tiling --optimizer $OPTIMIZERS --beta1 0.9 --beta2 0.95 --weight_decay 0.001 --max_grad_norm 1.0 --allow_tf32 --nccl_timeout 1800

感谢你的回复，不过目前单卡运行可以。但训练速度很慢 Epoch 0, global step 1, data_time: 2.645, model_time: 16.966, step_loss: 0.36216 Epoch 0, global step 2, data_time: 1.029, model_time: 14.720, step_loss: 0.16998 Epoch 0, global step 3, data_time: 0.975, model_time: 14.779, step_loss: 0.11621 Epoch 0, global step 4, data_time: 0.986, model_time: 14.710, step_loss: 0.14389 Epoch 0, global step 5, data_time: 1.085, model_time: 14.789, step_loss: 0.07038 Epoch 0, global step 6, data_time: 1.279, model_time: 15.023, step_loss: 0.06519 Epoch 0, global step 7, data_time: 0.963, model_time: 14.690, step_loss: 0.08907 Epoch 0, global step 8, data_time: 1.067, model_time: 14.795, step_loss: 0.06434 Epoch 0, global step 9, data_time: 0.977, model_time: 14.616, step_loss: 0.09751 Epoch 0, global step 10, data_time: 1.059, model_time: 14.692, step_loss: 0.07744 Epoch 0, global step 11, data_time: 1.269, model_time: 14.896, step_loss: 0.04926 Epoch 0, global step 12, data_time: 0.877, model_time: 14.500, step_loss: 0.16864 Epoch 0, global step 13, data_time: 1.073, model_time: 14.702, step_loss: 0.15544 Epoch 0, global step 14, data_time: 0.968, model_time: 14.605, step_loss: 0.09388 Epoch 0, global step 15, data_time: 1.164, model_time: 14.892, step_loss: 0.06555 Epoch 0, global step 16, data_time: 0.967, model_time: 14.607, step_loss: 0.06692 Epoch 0, global step 17, data_time: 0.878, model_time: 14.595, step_loss: 0.06572 Epoch 0, global step 18, data_time: 1.072, model_time: 14.699, step_loss: 0.11654

好的，多卡的话建议deepspeed，单卡就不建议使用了

lijain commented 1 day ago

有2台16张卡。之前也是用deepspeed的框架，但是速度没这么慢，所以请教下是什么问题？你这边有单卡或者多卡训练log？可以截图看看你的model_time?

THUDM / CogVideo

train_text_to_video_sft 中的设置 #555