AutodeskAILab / UV-Net

Code for UV-Net: Learning from Boundary Representations, CVPR 2021.
MIT License
82 stars 14 forks source link

Question on recommended cuda + pytorch + lightning #14

Closed dclanmaster closed 1 year ago

dclanmaster commented 1 year ago

Hi, I tried the configuration in the environment.yaml. As dgl-cuda11.0 is requiring cuda 11.0, I installed that. However, pytorch only support cuda 11.0 till 1.7.1 version, and respective pytorch lightning support is till 0.8.5 version. However, lightning 0.8.5 has an issue in missing "self.log", as following:

python classification.py train --dataset solidletters --dataset_path ../uv_net_dataset/SolidLetters --max_epochs 100 --batch_size 64 --experiment_name classification
Using backend: pytorch
GPU available: True, used: True
TPU available: False, using: 0 TPU cores

-----------------------------------------------------------------------------------
UV-Net Classification
-----------------------------------------------------------------------------------
Logs written to results/classification/0724/230341

To monitor the logs, run:
tensorboard --logdir results/classification/0724/230341

The trained model with the best validation loss will be written to:
results/classification/0724/230341/best.ckpt
-----------------------------------------------------------------------------------

Loading train data...
  0%|                                                                                                                                                                                                 | 0/61964 [00:00<?, ?it/s]/home/zpc/projectbed/UV-Net-main/venv2/lib/python3.9/site-packages/dgl/base.py:45: DGLWarning: You are loading a graph file saved by old version of dgl.              Please consider saving it again with the current format.
  return warnings.warn(message, category=category, stacklevel=1)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 61964/61964 [00:26<00:00, 2367.91it/s]
Done loading 61964 files
Loading val data...
  0%|                                                                                                                                                                                                 | 0/15492 [00:00<?, ?it/s]/home/zpc/projectbed/UV-Net-main/venv2/lib/python3.9/site-packages/dgl/base.py:45: DGLWarning: You are loading a graph file saved by old version of dgl.              Please consider saving it again with the current format.
  return warnings.warn(message, category=category, stacklevel=1)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15492/15492 [00:10<00:00, 1521.90it/s]
Done loading 15492 files
/home/zpc/projectbed/UV-Net-main/venv2/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:25: RuntimeWarning: You have defined a `val_dataloader()` and have defined a `validation_step()`, you may also want to define `validation_epoch_end()` for accumulating stats.
  warnings.warn(*args, **kwargs)

  | Name      | Type            | Params
----------------------------------------------
0 | model     | UVNetClassifier | 1 M   
1 | train_acc | Accuracy        | 0     
2 | val_acc   | Accuracy        | 0     
3 | test_acc  | Accuracy        | 0     
/home/zpc/projectbed/UV-Net-main/venv2/lib/python3.9/site-packages/pytorch_lightning/utilities/distributed.py:25: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/home/zpc/projectbed/UV-Net-main/classification.py", line 97, in <module>
    trainer.fit(model, train_loader, val_loader)
  File "/home/zpc/projectbed/UV-Net-main/venv2/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in fit
    results = self.run_pretrain_routine(model)
  File "/home/zpc/projectbed/UV-Net-main/venv2/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in run_pretrain_routine
    eval_results = self._evaluate(model,
  File "/home/zpc/projectbed/UV-Net-main/venv2/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 293, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/home/zpc/projectbed/UV-Net-main/venv2/lib/python3.9/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 470, in evaluation_forward
    output = model.validation_step(*args)
  File "/home/zpc/projectbed/UV-Net-main/uvnet/models.py", line 163, in validation_step
    self.log("val_loss", loss, on_step=False, on_epoch=True)
  File "/home/zpc/projectbed/UV-Net-main/venv2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 778, in __getattr__
    raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'Classification' object has no attribute 'log'

Therefore, I tried to upgrade lightning to 0.9.0, 0.10.0. Of course, the code is running ok, but the issue becomes that GPU is not used, as following:

Using backend: pytorch
GPU available: True, used: False
TPU available: False, using: 0 TPU cores

I'm wondering whether I can upgrade cuda to 11.3 and use higher version of pytorch? If yes, should I also change to dgl-cuda11.3?