CompVis / taming-transformers

Taming Transformers for High-Resolution Image Synthesis
https://arxiv.org/abs/2012.09841
MIT License
5.7k stars 1.13k forks source link

Can't train used to multi gpu #184

Open ryhhtn opened 1 year ago

ryhhtn commented 1 year ago

thank you for making this code public.

I want to train vqgan from scrach, but error like the one of below

python3 main.py --base configs/imagenet_vqgan.yaml -t True --gpus 0,1,2,3,4,5,6,7

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 208, in _wrapped_function
    result = function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 236, in new_process
    results = trainer.run_stage()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1306, in _run_train
    self._pre_training_routine()
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1301, in _pre_training_routine
    self.call_hook("on_pretrain_routine_start")
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
    callback_fx(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 148, in on_pretrain_routine_start
    callback.on_pretrain_routine_start(self, self.lightning_module)
  File "/****/Projects/taming-transformers/main.py", line 200, in on_pretrain_routine_start
    OmegaConf.save(self.config,
  File "/usr/local/lib/python3.8/dist-packages/omegaconf/omegaconf.py", line 220, in save
    with io.open(os.path.abspath(f), "w", encoding="utf-8") as file:
FileNotFoundError: [Errno 2] No such file or directory: '/***/Projects/taming-transformers/logs/2022-10-18T19-25-09_imagenet_vqgan/configs/2022-10-18T19-25-09-project.yaml'
samsunq commented 1 year ago

I had this trouble yet, fixed as : delete this file in log or u better delete all file in log file.and try again.

ControlNet commented 1 year ago

I had the same issue in slurm. I guess the problem is caused by the following code. https://github.com/CompVis/taming-transformers/blob/3ba01b241669f5ade541ce990f7650a3b8f65318/main.py#L206-L215 After the log directory is created, another process move the log directory to "child_runs", so OmegaConf cannot create new file in an unexisted directory.

So I just removed this and it seems is runnable.

Polaris0421 commented 3 months ago

you saved my night!!!! A big thanks from 1years later lol