Lightning-Universe / lightning-bolts

Toolbox of models, callbacks, and datasets for AI/ML researchers.
https://lightning-bolts.readthedocs.io
Apache License 2.0
1.69k stars 321 forks source link

How to use SSLOnlineEvaluator callback #309

Closed FrankXinqi closed 3 years ago

FrankXinqi commented 3 years ago

❓ Questions and Help

Before asking:

Lightning Forum Posted Question: https://forums.pytorchlightning.ai/t/how-to-use-sslonlineevaluator/309

What is your question?

I study the code provided here SSL SimCLR on colab, and implemented a similar code.

I have modified the datamodule to load mine, and I can run the code. However, if I use SSLOnlineEvaluator. I cannot make it work. If I specify the callbacks for model, the error shows: TypeError: on_train_batch_end() takes 6 positional arguments but 7 were given

Code

I have the following code:

def to_device(batch, device):
    (img1, _), y = batch
    img1 = img1.to(device)
    y = y.to(device)
    return img1, y

online_finetuner = SSLOnlineEvaluator(z_dim=2048 * 2 * 2, num_classes = 7)
online_finetuner.to_device = to_device

lr_logger = LearningRateMonitor()

callbacks = [online_finetuner, lr_logger]

# pick data
rafdb_height=100
batch_size=64

# data
dm = RAFDBDataModule(num_workers=0)
dm.train_transforms = SimCLRTrainDataTransform(input_height=rafdb_height)
dm.val_transforms = SimCLREvalDataTransform(input_height=rafdb_height)
dm.test_transforms = SimCLREvalDataTransform(input_height=rafdb_height)

# model
model = SimCLR(num_samples=dm.num_samples, batch_size=batch_size)

# fit
trainer = pl.Trainer(gpus=1, max_epochs=20, callbacks=callbacks)
trainer.fit(model, dm)

Full Output information:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                 | Type         | Params
------------------------------------------------------
0 | encoder              | ResNet       | 25 M  
1 | projection           | Projection   | 4 M   
2 | non_linear_evaluator | SSLEvaluator | 8 M   
Epoch 0:   0%|          | 0/383 [00:00<?, ?it/s] Traceback (most recent call last):
  File "H:/Project/PycharmProject/FER_Long-tailed/train_SimCLR.py", line 236, in <module>
    trainer.fit(model, dm)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 483, in train
    self.train_loop.run_training_epoch()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 557, in run_training_epoch
    self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 249, in on_train_batch_end
    self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 823, in call_hook
    trainer_hook(*args, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 147, in on_train_batch_end
    callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
TypeError: on_train_batch_end() takes 6 positional arguments but 7 were given
Epoch 0:   0%|          | 0/383 [00:02<?, ?it/s]

Process finished with exit code 1

What have you tried?

Try to put OnlineEvaluator as the only callback, but it does not work as well.

What's your environment?

awaelchli commented 3 years ago

I believe this PR #277 fixed this

FrankXinqi commented 3 years ago

Cool. I see that an 'output' arg was added. I will install the latest version from GitHub.

Just an offset question: If I do not want to install the blots package again by pip install git+https://github.com/PytorchLightning/pytorch-lightning-bolts.git@master --upgrade, instead I download the package from GitHub, put it in my project folder, and just call any function by its relative path. Would this work with this kind of big project?

awaelchli commented 3 years ago

I never tried that but you should be able to make it work. You probably have to add it to the python path or similar to have the imports working. We don't recommend this, because you won't get any updates and have to manually repeat the process if you need to get new version.

FrankXinqi commented 3 years ago

Cool. There is a new bug, when I upgrade to the master version. The network maltiplication has a dimension not match as

  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0

The output information

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                 | Type         | Params
------------------------------------------------------
0 | encoder              | ResNet       | 25 M  
1 | projection           | Projection   | 4 M   
2 | non_linear_evaluator | SSLEvaluator | 8 M   
Epoch 0:   0%|          | 0/383 [00:00<?, ?it/s] Traceback (most recent call last):
  File "H:/Project/PycharmProject/FER_Long-tailed/train_SimCLR.py", line 276, in <module>
    trainer.fit(model, dm)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 483, in train
    self.train_loop.run_training_epoch()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 557, in run_training_epoch
    self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 249, in on_train_batch_end
    self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 823, in call_hook
    trainer_hook(*args, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 147, in on_train_batch_end
    callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pl_bolts\callbacks\ssl_online.py", line 85, in on_train_batch_end
    mlp_preds = pl_module.non_linear_evaluator(representations)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pl_bolts\models\self_supervised\evaluator.py", line 30, in forward
    logits = self.block_forward(x)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\container.py", line 117, in forward
    input = module(input)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\modules\linear.py", line 91, in forward
    return F.linear(input, self.weight, self.bias)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\torch\nn\functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: mat1 dim 1 must match mat2 dim 0
Epoch 0:   0%|          | 0/383 [00:00<?, ?it/s]

This bug only happens if I add callback. I works fine for a few time(not sure if it is step, iterations ...), but then it gives input.shape=32,100352, weight.t().shape=8192,1024.

FrankXinqi commented 3 years ago

The above mentioned issue is with my customed datamodule. Here is a different issue with official CIFAR10.

The issue seems to relate with CUDA, not sure if the problem is on my side. Just to make it clear, all codes run well without callbacks passed in.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Files already downloaded and verified
Files already downloaded and verified

  | Name                 | Type         | Params
------------------------------------------------------
0 | encoder              | ResNet       | 25 M  
1 | projection           | Projection   | 4 M   
2 | non_linear_evaluator | SSLEvaluator | 8 M   
Epoch 0:   0%|          | 0/1562 [00:00<?, ?it/s] C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [24,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [27,0,0] Assertion `t >= 0 && t < n_classes` failed.
C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [31,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "H:/Project/PycharmProject/FER_Long-tailed/example/SimCLR.py", line 36, in <module>
    trainer.fit(model, dm)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 483, in train
    self.train_loop.run_training_epoch()
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 557, in run_training_epoch
    self.on_train_batch_end(epoch_output, epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 249, in on_train_batch_end
    self.trainer.call_hook('on_train_batch_end', epoch_end_outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 823, in call_hook
    trainer_hook(*args, **kwargs)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 147, in on_train_batch_end
    callback.on_train_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pl_bolts\callbacks\ssl_online.py", line 95, in on_train_batch_end
    acc = accuracy(mlp_preds, y, num_classes=trainer.datamodule.num_classes)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\metrics\functional\classification.py", line 286, in accuracy
    tps, fps, tns, fns, sups = stat_scores_multiple_classes(
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\metrics\functional\classification.py", line 197, in stat_scores_multiple_classes
    num_classes = get_num_classes(pred=pred, target=target, num_classes=num_classes)
  File "F:\Program\Anaconda3\envs\pytorch-lightning\lib\site-packages\pytorch_lightning\metrics\functional\classification.py", line 96, in get_num_classes
    num_target_classes = int(target.max().detach().item() + 1)
RuntimeError: CUDA error: device-side assert triggered
Epoch 0:   0%|          | 0/1562 [00:00<?, ?it/s]
ananyahjha93 commented 3 years ago

@FrankXinqi what is the number of classes being passed to SSLOnlineEvaluator. What looks to me is the case when num_classes being passed is different than the class labels found in target tensor.

ananyahjha93 commented 3 years ago

@FrankXinqi for the first issue, I am merging a fix which should solve this issue. The output of resnet now is a flattened 2048 dimensional feature vector which is taken by the online evaluation callback.

FrankXinqi commented 3 years ago

@FrankXinqi what is the number of classes being passed to SSLOnlineEvaluator. What looks to me is the case when num_classes being passed is different than the class labels found in target tensor.

My cutomized dataset has 7 output classes. For CIFAR10, it put is as 10. However, neither my own dataset or pl CIFAR10 do not work.

FrankXinqi commented 3 years ago

@FrankXinqi for the first issue, I am merging a fix which should solve this issue. The output of resnet now is a flattened 2048 dimensional feature vector which is taken by the online evaluation callback.

Many thanks. For the first issue, do you mean this issue "RuntimeError: mat1 dim 1 must match mat2 dim 0"? If so, it may take some time to merge your PR to the master branch. Would you mind pointing out the solution, so I can fix it locally?

Thx.

ananyahjha93 commented 3 years ago

@FrankXinqi changed quite a few things around, like a whole revamp of simclr, difficult to point to a single change.

https://github.com/PyTorchLightning/pytorch-lightning-bolts/pull/329

FrankXinqi commented 3 years ago

@FrankXinqi changed quite a few things around, like a whole revamp of simclr, difficult to point to a single change.

329

Thanks. Seems like there was a merge 14 days ago, but not recently. Wanna to know if I reinstall the lastest one from Github, will the problem be fixed?

ananyahjha93 commented 3 years ago

@FrankXinqi

https://github.com/PyTorchLightning/pytorch-lightning-bolts/tree/fix/simclr

This is the branch that will be merged. The PR will close this issue because we have removed all the complexity of SSLOnlineEvaluator callback. Feel free to open new issues based on the new code base if something doesn't work still.

FrankXinqi commented 3 years ago

Thanks. I will try it later and response to this issue still, if there are other bugs met.