VVRPanda / ExpPoint-MAE

10 stars 0 forks source link

Training with more than one GPU #2

Open LucasOyarzun opened 10 months ago

LucasOyarzun commented 10 months ago

Hello, I want to try training with 3 GPUs, can you help me set up the repository for this?

I just naively changed the trainer's devices to 3, but it threw errors. Could you help me?

Here are the logs:

INFO - 2023-10-14 21:59:55,428 - distributed_c10d - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
INFO - 2023-10-14 21:59:55,437 - distributed_c10d - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 3 nodes.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name           | Type                      | Params
-------------------------------------------------------------
0 | loss_func      | ChamferDistanceL2         | 0     
1 | group_devider  | Group                     | 0     
2 | mask_generator | Mask                      | 0     
3 | MAE_encoder    | TransformerWithEmbeddings | 21.8 M
4 | MAE_decoder    | TransformerWithEmbeddings | 7.1 M 
5 | increase_dim   | Conv1d                    | 37.0 K
-------------------------------------------------------------
29.0 M    Trainable params
0         Non-trainable params
29.0 M    Total params
116.023   Total estimated model params size (MB)

miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: UserWarning: num_workers>0, persistent_workers=False, and strategy=ddp_spawn may result in data loading bottlenecks. Consider setting persistent_workers=True (this is a limitation of Python .spawn() and PyTorch)
  rank_zero_warn(
Training: 0it [00:00, ?it/s]INFO - 2023-10-14 21:59:56,950 - backend - multiprocessing start_methods=fork,spawn,forkserver, using: spawn
Traceback (most recent call last):
  File "pretrain.py", line 146, in <module>
    main(args)
  File "pretrain.py", line 134, in main
    trainer.fit(model, train_dataloaders=train_loader)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 78, in launch
    mp.spawn(
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/miniconda3/envs/exppointmae/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/spawn.py", line 101, in _wrapping_function
    results = function(*args, **kwargs)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1353, in _run_train
    self.fit_loop.run()
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 203, in run
    self.on_advance_start(*args, **kwargs)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 254, in on_advance_start
    self.trainer._call_callback_hooks("on_train_epoch_start")
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1636, in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/callbacks/lr_monitor.py", line 170, in on_train_epoch_start
    logger.log_metrics(latest_stat, step=trainer.fit_loop.epoch_loop._batches_that_stepped)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 382, in log_metrics
    self.experiment.log({**metrics, "trainer/global_step": step})
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 41, in experiment
    return get_experiment() or DummyExperiment()
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/utilities/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loggers/base.py", line 39, in get_experiment
    return fn(self)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/pytorch_lightning/loggers/wandb.py", line 354, in experiment
    self._experiment = wandb._attach(attach_id)
  File "miniconda3/envs/exppointmae/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 762, in _attach
    raise UsageError("problem")
wandb.errors.UsageError: problem

wandb: Waiting for W&B process to finish... (failed 1).
wandb:                                                                                
wandb: You can sync this run to the cloud by running:
wandb: wandb sync ExpPoint-MAE/wandb/offline-run-20231014_215940-ghfv9t5o
wandb: Find logs at: ./wandb/offline-run-20231014_215940-ghfv9t5o/logs
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
(exppointmae) ExpPoint-MAE$ Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
miniconda3/envs/exppointmae/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 30 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
JohnRomanelis commented 10 months ago

Hello, unfortunatelly we do not have a multi-GPU computer available in our lab, so we cannot test this case. Although I believe you may meet the memory requirements of pretraining, if you are using a 24GB GPU (probably a 16GB GPU will also be enough). For finetuning 8GBs of GPU RAM is enough.