VectorSpaceLab / OmniGen

OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
MIT License
2.95k stars 235 forks source link

Full finetune doesn't work #143

Open megatomik opened 5 days ago

megatomik commented 5 days ago

I'm trying to reproduce training using the configs in the readme as is and the toy dataset. So far I can train a LoRA (though the loss doesn't seem to be going anywhere even after 10 epochs, but that might be another issue). However launching a full finetune fails right before the epoch should start:

...
[2024-11-23 19:45:31] Downloaded model to /root/.cache/huggingface/hub/models--Shitao--OmniGen-v1/snapshots/58e249c7c7634423c0ba41c34a774af79aa87889
Loading safetensors
[2024-11-23 19:46:25] Dataset contains 11
[rank0]: Traceback (most recent call last):
[rank0]:   File "/OmniGen/train.py", line 398, in <module>
[rank0]:     main(args)
[rank0]:   File "/OmniGen/train.py", line 179, in main
[rank0]:     model = accelerator.prepare(model)
[rank0]:   File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/accelerate/accelerator.py", line 1329, in prepare
[rank0]:     result = tuple(
[rank0]:   File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/accelerate/accelerator.py", line 1330, in <genexpr>
[rank0]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]:   File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/accelerate/accelerator.py", line 1205, in _prepare_one
[rank0]:     return self.prepare_model(obj, device_placement=device_placement)
[rank0]:   File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/accelerate/accelerator.py", line 1482, in prepare_model
[rank0]:     fsdp_plugin.param_init_fn = ensure_weights_retied(
[rank0]:   File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/accelerate/utils/fsdp_utils.py", line 331, in ensure_weights_retied
[rank0]:     _tied_names = model._tied_weights_keys
[rank0]:   File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in __getattr__
[rank0]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank0]: AttributeError: 'OmniGen' object has no attribute '_tied_weights_keys'
[rank0]:[W1123 19:46:25.227136747 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
E1123 19:46:26.321000 139877418833728 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 644) of binary: /opt/conda/envs/omnigen/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/omnigen/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1155, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/omnigen/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-23_19:46:26
  host      : 83c34070446a
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 644)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

I have tried both multi and single GPU on several machines and with up to 32GB of VRAM per card.

staoxiao commented 4 days ago

can you show me the version of your accelerate and transformers packages?

megatomik commented 4 days ago

transformers: 4.46.3 accelerate: 1.1.1

staoxiao commented 2 days ago

I find this error will raise when use latest acclerate, and you can solve it by installing lower version:pip install accelerate==0.26.1

megatomik commented 2 days ago

I find this error will raise when use latest acclerate, and you can solve it by installing lower version:pip install accelerate==0.26.1

Thanks, that solved it for now. Quick follow up question if you don't mind, why is the recommended learning rate in the standard finetune config in the readme so high compared to the ones in the paper? When the paper uses the same learning rate the batch size is 50x higher.

staoxiao commented 2 days ago

@megatomik , if use lora, since the learnable parameters are very little, the lr should be high. For full-finetuning, the lr is normal.