Closed BruceLeeeee closed 4 months ago
Hi Bruce,
Could you give more error messages when you are using our code? @BruceLeeeee
TypeError: DiT_Llama.init() got an unexpected keyword argument 'max_seq_len'
Okay, we will check the training code again to fix this problem.
Hi, @PommesPeter I tried to finetune the model, but no matter how I set it up the image size or batch size or precision (I remove the 'max_seq_len' argument), I got an CUDA out of memory error. I tried on single 80G A100.
Hi, @PommesPeter I tried to finetune the model, but no matter how I set it up the image size or batch size or precision (I remove the 'max_seq_len' argument), I got an CUDA out of memory error. I tried on single 80G A100.
Hi Bruce,
We are checking the training/tuning code and since China is on holiday now, we will fix all bugs after the holiday.
Hi, @PommesPeter I tried to finetune the model, but no matter how I set it up the image size or batch size or precision (I remove the 'max_seq_len' argument), I got an CUDA out of memory error. I tried on single 80G A100.
Could you give more error information (e.g. error traceback, etc.), we will fix the problems, such as running command, datasets, etc.
Traceback (most recent call last):
File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/train.py", line 753, in
main(args) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/train.py", line 514, in main loss_dict = transport.training_losses(model, x_mb, model_kwargs) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/transport/transport.py", line 148, in training_losses model_output = model(xt, t, model_kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward output = self._fsdp_wrapped_module(*args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/models/model.py", line 751, in forward x = layer( File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward output = self._fsdp_wrapped_module(args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/models/model.py", line 555, in forward modulate(self.ffn_norm(x), shift_mlp, scale_mlp), File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/models/components.py", line 53, in forward output = self._norm(x.float()).type_as(x) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/models/components.py", line 40, in _norm return x torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 79.35 GiB of which 32.19 MiB is free. Process 30630 has 79.31 GiB memory in use. Of the allocated memory 76.58 GiB is allocated by PyTorch, and 848.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Training arguments: { "results_dir": "checkpoints_Lumina-T2I/train_run_0", "model": "DiT_Llama_5B_patch2", "image_size": 1024, "max_steps": 1000000, "global_batch_size": 1, "micro_batch_size": 1, "global_seed": 0, "vae": "sdxl", "num_workers": 2, "log_every": 1000, "ckpt_every": 1000, "master_port": 8964, "model_parallel_size": 1, "data_parallel": "fsdp", "precision": "fp16", "grad_precision": "fp16", "local_diffusers_model_root": "sdxl-vae-fp16-fix", "lr": 0.0001, "auto_resume": true, "resume": null, "init_from": "Lumina-T2I", "grad_clip": 2.0, "wd": 0.0, "qk_norm": true, "tokenizer_path": "Llama-2-7b-hf", "lm": "Llama-2-7b-hf", "caption_dropout_prob": 0.1, "max_text_tokens": 128, "rope_scaling_factor": 1.0, "snr_type": "uniform", "max_seq_len": 4224 (unused) }
torch 2.1.0+cu118 transformers 4.40.1 diffusers 0.27.2 flash-attn 2.3.1.post1 accelerate 0.30.0
@PommesPeter
@BruceLeeeee we use precision: bf16
and grad_precision: fp32
, you can try these hyper-parameters, and we will check the model.py
code whether is correct.
Traceback (most recent call last): File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/train.py", line 753, in main(args) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/train.py", line 514, in main loss_dict = transport.training_losses(model, x_mb, model_kwargs) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/transport/transport.py", line 148, in training_losses model_output = model(xt, t, model_kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward output = self._fsdp_wrapped_module(*args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/models/model.py", line 751, in forward x = layer( File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward output = self._fsdp_wrapped_module(args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/models/model.py", line 555, in forward modulate(self.ffn_norm(x), shift_mlp, scale_mlp), File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/opt/conda/envs/Lumina_T2X/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/models/components.py", line 53, in forward output = self._norm(x.float()).type_as(x) File "/mnt/new_sfs_turbo/lsh2/train_projects/Lumina-T2X/lumina_t2i/models/components.py", line 40, in _norm return x torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 79.35 GiB of which 32.19 MiB is free. Process 30630 has 79.31 GiB memory in use. Of the allocated memory 76.58 GiB is allocated by PyTorch, and 848.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Training arguments: { "results_dir": "checkpoints_Lumina-T2I/train_run_0", "model": "DiT_Llama_5B_patch2", "image_size": 1024, "max_steps": 1000000, "global_batch_size": 1, "micro_batch_size": 1, "global_seed": 0, "vae": "sdxl", "num_workers": 2, "log_every": 1000, "ckpt_every": 1000, "master_port": 8964, "model_parallel_size": 1, "data_parallel": "fsdp", "precision": "fp16", "grad_precision": "fp16", "local_diffusers_model_root": "sdxl-vae-fp16-fix", "lr": 0.0001, "auto_resume": true, "resume": null, "init_from": "Lumina-T2I", "grad_clip": 2.0, "wd": 0.0, "qk_norm": true, "tokenizer_path": "Llama-2-7b-hf", "lm": "Llama-2-7b-hf", "caption_dropout_prob": 0.1, "max_text_tokens": 128, "rope_scaling_factor": 1.0, "snr_type": "uniform", "max_seq_len": 4224 (unused) }
torch 2.1.0+cu118 transformers 4.40.1 diffusers 0.27.2 flash-attn 2.3.1.post1 accelerate 0.30.0
@PommesPeter
we will reproduce the problem with your environment setting.
Hi, @PommesPeter Can you provide the training arguments of DiT_Llama_5B_patch2 checkpoint?
training用的是FSDP,会shard paraemter /gradient,所以GPU memory会随着GPU变多,变少。 单卡A100跑不起来
FSDP need mulit-gpu training to reduce GPU memory (DiT_Llama_5B_patch2)
TypeError: DiT_Llama.init() got an unexpected keyword argument 'max_seq_len'