Closed FrankXinqi closed 3 years ago
I believe this PR #277 fixed this
Cool. I see that an 'output' arg was added. I will install the latest version from GitHub.
Just an offset question: If I do not want to install the blots package again by
pip install git+https://github.com/PytorchLightning/pytorch-lightning-bolts.git@master --upgrade
, instead I download the package from GitHub, put it in my project folder, and just call any function by its relative path. Would this work with this kind of big project?
I never tried that but you should be able to make it work. You probably have to add it to the python path or similar to have the imports working. We don't recommend this, because you won't get any updates and have to manually repeat the process if you need to get new version.
Cool. There is a new bug, when I upgrade to the master version. The network maltiplication has a dimension not match as
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\functional.py", line 1676, in linear
output = input.matmul(weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0
The output information
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
------------------------------------------------------
0 | encoder | ResNet | 25 M
1 | projection | Projection | 4 M
2 | non_linear_evaluator | SSLEvaluator | 8 M
Epoch 0: 0%| | 0/383 [00:00<?, ?it/s] Traceback (most recent call last):
File "H:/Project/PycharmProject/FER_Long-tailed/train_SimCLR.py", line 276, in <module>
trainer.fit(model, dm)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 54, in train
results = self.train_or_test()
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 483, in train
self.train_loop.run_training_epoch()
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 557, in run_training_epoch
self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 249, in on_train_batch_end
self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 823, in call_hook
trainer_hook(*args, **kwargs)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 147, in on_train_batch_end
callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pl_bolts\callbacks\ssl_online.py", line 85, in on_train_batch_end
mlp_preds = pl_module.non_linear_evaluator(representations)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pl_bolts\models\self_supervised\evaluator.py", line 30, in forward
logits = self.block_forward(x)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\container.py", line 117, in forward
input = module(input)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\linear.py", line 91, in forward
return F.linear(input, self.weight, self.bias)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\functional.py", line 1676, in linear
output = input.matmul(weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0
Epoch 0: 0%| | 0/383 [00:00<?, ?it/s]
This bug only happens if I add callback. I works fine for a few time(not sure if it is step, iterations ...), but then it gives input.shape=32,100352, weight.t().shape=8192,1024
.
The above mentioned issue is with my customed datamodule. Here is a different issue with official CIFAR10.
The issue seems to relate with CUDA, not sure if the problem is on my side. Just to make it clear, all codes run well without callbacks passed in.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Files already downloaded and verified
Files already downloaded and verified
| Name | Type | Params
------------------------------------------------------
0 | encoder | ResNet | 25 M
1 | projection | Projection | 4 M
2 | non_linear_evaluator | SSLEvaluator | 8 M
Epoch 0: 0%| | 0/1562 [00:00<?, ?it/s] C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "H:/Project/PycharmProject/FER_Long-tailed/example/SimCLR.py", line 36, in <module>
trainer.fit(model, dm)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 440, in fit
results = self.accelerator_backend.train()
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 54, in train
results = self.train_or_test()
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 483, in train
self.train_loop.run_training_epoch()
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 557, in run_training_epoch
self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 249, in on_train_batch_end
self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 823, in call_hook
trainer_hook(*args, **kwargs)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 147, in on_train_batch_end
callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pl_bolts\callbacks\ssl_online.py", line 95, in on_train_batch_end
acc = accuracy(mlp_preds, y, num_classes=trainer.datamodule.num_classes)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\metrics\functional\classification.py", line 286, in accuracy
tps, fps, tns, fns, sups = stat_scores_multiple_classes(
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\metrics\functional\classification.py", line 197, in stat_scores_multiple_classes
num_classes = get_num_classes(pred=pred, target=target, num_classes=num_classes)
File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\metrics\functional\classification.py", line 96, in get_num_classes
num_target_classes = int(target.max().detach().item() + 1)
RuntimeError: CUDA error: device-side assert triggered
Epoch 0: 0%| | 0/1562 [00:00<?, ?it/s]
@FrankXinqi what is the number of classes being passed to SSLOnlineEvaluator. What looks to me is the case when num_classes being passed is different than the class labels found in target tensor.
@FrankXinqi for the first issue, I am merging a fix which should solve this issue. The output of resnet now is a flattened 2048 dimensional feature vector which is taken by the online evaluation callback.
@FrankXinqi what is the number of classes being passed to SSLOnlineEvaluator. What looks to me is the case when num_classes being passed is different than the class labels found in target tensor.
My cutomized dataset has 7 output classes. For CIFAR10, it put is as 10. However, neither my own dataset or pl CIFAR10 do not work.
@FrankXinqi for the first issue, I am merging a fix which should solve this issue. The output of resnet now is a flattened 2048 dimensional feature vector which is taken by the online evaluation callback.
Many thanks. For the first issue, do you mean this issue "RuntimeError: mat1 dim 1 must match mat2 dim 0"? If so, it may take some time to merge your PR to the master branch. Would you mind pointing out the solution, so I can fix it locally?
Thx.
@FrankXinqi changed quite a few things around, like a whole revamp of simclr, difficult to point to a single change.
https://github.com/PyTorchLightning/pytorch-lightning-bolts/pull/329
@FrankXinqi changed quite a few things around, like a whole revamp of simclr, difficult to point to a single change.
329
Thanks. Seems like there was a merge 14 days ago, but not recently. Wanna to know if I reinstall the lastest one from Github, will the problem be fixed?
@FrankXinqi
https://github.com/PyTorchLightning/pytorch-lightning-bolts/tree/fix/simclr
This is the branch that will be merged. The PR will close this issue because we have removed all the complexity of SSLOnlineEvaluator callback. Feel free to open new issues based on the new code base if something doesn't work still.
Thanks. I will try it later and response to this issue still, if there are other bugs met.
❓ Questions and Help
Before asking:
Lightning Forum Posted Question: https://forums.pytorchlightning.ai/t/how-to-use-sslonlineevaluator/309
What is your question?
I study the code provided here SSL SimCLR on colab, and implemented a similar code.
I have modified the datamodule to load mine, and I can run the code. However, if I use SSLOnlineEvaluator. I cannot make it work. If I specify the callbacks for model, the error shows:
TypeError: on_train_batch_end() takes 6 positional arguments but 7 were given
Code
I have the following code:
Full Output information:
What have you tried?
Try to put OnlineEvaluator as the only callback, but it does not work as well.
What's your environment?
pytorch-lightning 1.0.2 py_0 conda-forge pytorch-lightning-bolts 0.2.5 pypi_0 pypi