Mikubill / naifu

Train generative models with pytorch lightning
MIT License
284 stars 36 forks source link

Colab TPU Failed #9

Closed Lime-Cakes closed 1 year ago

Lime-Cakes commented 1 year ago

I tested with a modified colab notebook based on your example. The notebook is here . It installs requirement_tpu and runs using default tpu config.

Got error:

/content/naifu-diffusion
Downloading: "https://pub-2fdef7a2969f43289c42ac5ae3412fd4.r2.dev/animesfw.tgz" to /tmp/model

100% 3.58G/3.58G [02:16<00:00, 28.2MB/s]
BucketManager initialized with base_res = [512, 512], max_size = [768, 512]
Downloading: "https://pub-2fdef7a2969f43289c42ac5ae3412fd4.r2.dev/mmk.tgz" to /tmp/dataset-0

100% 52.7M/52.7M [00:02<00:00, 19.2MB/s]
Loading resolutions: 34it [00:00, 992.53it/s]
Loading captions: 68it [00:00, 26123.16it/s]
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 3 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:452: LightningDeprecationWarning: Setting `Trainer(tpu_cores=8)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='tpu', devices=8)` instead.
  f"Setting `Trainer(tpu_cores={tpu_cores!r})` is deprecated in v1.7 and will be removed"
/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:467: UserWarning: The flag `devices=-1` will be ignored, instead the device specific number 8 will be used
  f"The flag `devices={devices}` will be ignored, "
/usr/local/lib/python3.7/dist-packages/lightning_lite/accelerators/cuda.py:159: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 296) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 755, in run
    )(*cmd_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
trainer.py FAILED
----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-11-19_18:29:43
  host      : 21021965512a
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 296)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 296
====================================================
Mikubill commented 1 year ago

Looks like the current torch_xla implementation only works with TPUs=1 and is very slow. I'm changing it to Diffusers's Flax Pipeline, hope this improves performance