Trying to run the trainer on a recent installation (CUDA toolkit 11.0, cudnn 8.0, pytorch 1.7.0) I get the following error:
Start training epoch 1
0%| | 0/391 [00:01<?, ?it/s]
Traceback (most recent call last):
File "phd_lab/extract_latent_representations.py", line 26, in <module>
main()
File "/space/conda/user/ulf/envs/phd-lab/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/space/conda/user/ulf/envs/phd-lab/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/space/conda/user/ulf/envs/phd-lab/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/space/conda/user/ulf/envs/phd-lab/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "phd_lab/extract_latent_representations.py", line 22, in main
main(config_path=Path(config), run_id=run_id, device=device)
File "/home/ulf/projects/github/phd-lab/phd_lab/experiments/main.py", line 81, in __call__
executor(
File "/home/ulf/projects/github/phd-lab/phd_lab/experiments/train_test_executor.py", line 130, in __call__
trainer.train()
File "/home/ulf/projects/github/phd-lab/phd_lab/experiments/trainer.py", line 229, in train
train_metric = self.train_epoch()
File "/home/ulf/projects/github/phd-lab/phd_lab/experiments/trainer.py", line 265, in train_epoch
self._eval_metrics(labels, outputs)
File "/home/ulf/projects/github/phd-lab/phd_lab/experiments/trainer.py", line 170, in _eval_metrics
metric.update(y_true, y_pred)
File "/home/ulf/projects/github/phd-lab/phd_lab/metrics/classification.py", line 35, in update
self.accuracy_accumulator += self._accuracy(y_pred, y_true, (5,))[0]
File "/home/ulf/projects/github/phd-lab/phd_lab/metrics/classification.py", line 25, in _accuracy
correct_k = correct[:k].view(-1).float().sum(0)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Downgrading to CUDA Toolkit 10.2, cudnn 7.6.5, pytorch 1.5.1 solves that problem.
Trying to run the trainer on a recent installation (CUDA toolkit 11.0, cudnn 8.0, pytorch 1.7.0) I get the following error:
Downgrading to CUDA Toolkit 10.2, cudnn 7.6.5, pytorch 1.5.1 solves that problem.