Use the code with AlexNet but fail when saving binary model

flyingpot commented 5 years ago

Your code is great! However, when I use the code in AlexNet model, an error occurred when saving the binary model after one epoch. The log is here:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1523242347739/work/torch/csrc/generic/serialization.cpp line=38 error=77 : an illegal memory access was encountered
Traceback (most recent call last):
  File "main.py", line 396, in <module>
    train_bin(epoch)
  File "main.py", line 128, in train_bin
    bin_save_state(args, model_train)
  File "../util/util.py", line 36, in bin_save_state
    torch.save(state, 'models/' + args.arch + '.pth')
  File "/home/fjb/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 135, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/fjb/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 117, in _with_file_like
    return body(f)
  File "/home/fjb/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 135, in <lambda>
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/fjb/miniconda3/envs/pytorch0.3/lib/python3.5/site-packages/torch/serialization.py", line 204, in _save
    serialized_storages[key]._write_file(f)
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1523242347739/work/torch/csrc/generic/serialization.cpp:38

The environment is same with yours, and I succeed in other arch you provide.

The binary AlexNet code is here:

import torch
import torch.nn as nn
import torch.nn.functional as F
import sys
sys.path.append("..")
from util import BinLinear
from util import BinConv2d

class Bin_AlexNet_train(nn.Module):

    def __init__(self):
        super(Bin_AlexNet_train, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=0),
            nn.BatchNorm2d(96, eps=1e-4, momentum=0.1, affine=True),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            BinConv2d(96, 256, kernel_size=5, stride=1, padding=2, istrain=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            BinConv2d(256, 384, kernel_size=3, stride=1, padding=1, istrain=True),
            BinConv2d(384, 384, kernel_size=3, stride=1, padding=1, istrain=True),
            BinConv2d(384, 256, kernel_size=3, stride=1, padding=1, istrain=True),
            nn.MaxPool2d(kernel_size=3, stride=2)
        )
        self.classifier = nn.Sequential(
            BinLinear(256 * 6 * 6, 4096, istrain=True),
            BinLinear(4096, 4096, istrain=True),
            nn.BatchNorm1d(4096, eps=1e-3, momentum=0.1, affine=True),
            nn.Linear(4096, 10)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(-1, 256 * 6 * 6)
        x = self.classifier(x)
        return x

class Bin_AlexNet_test(nn.Module):

    def __init__(self):
        super(Bin_AlexNet_test, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=0),
            nn.BatchNorm2d(96, eps=1e-4, momentum=0.1, affine=True),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            BinConv2d(96, 256, kernel_size=5, stride=1, padding=2, istrain=False),
            nn.MaxPool2d(kernel_size=3, stride=2),
            BinConv2d(256, 384, kernel_size=3, stride=1, padding=1, istrain=False),
            BinConv2d(384, 384, kernel_size=3, stride=1, padding=1, istrain=False),
            BinConv2d(384, 256, kernel_size=3, stride=1, padding=1, istrain=False),
            nn.MaxPool2d(kernel_size=3, stride=2)
        )
        self.classifier = nn.Sequential(
            BinLinear(256 * 6 * 6, 4096, istrain=False),
            BinLinear(4096, 4096, istrain=False),
            nn.BatchNorm1d(4096, eps=1e-3, momentum=0.1, affine=True),
            nn.Linear(4096, 10)
        )

    def forward(self, x):
        x = self.features(x)
        x = x.view(-1, 256 * 6 * 6)
        x = self.classifier(x)
        return x

Also, the unbinarized AlexNet can run successfully.

Could you please tell me how to solve the problem? Thank you!

cooooorn commented 5 years ago

Actually this code is not so great when I look back one year later, but I have no time and no motivation to rewrite it for the newest pytorch version.

The most deadly bug for this 'gemm+im2col' method is when padding > 0, it can't avoid extra 0(-1 in calculation) joining in calculation. This will cause the accuracy rate to drop by about 1% in VGGs model.

Back to your problem:

Check whether every weight correctly binarized by function 'binop.encode_rows' or not.
Rewrite this model as ./Cifar10/models/Bin_VGG.py's style.
Maybe 'padding = 2' can raise some errors? Try other network which padding<=1 in all conv layers.

flyingpot commented 5 years ago

Thank you! I found that nn.BatchNorm2d went wrong. Though I don't know why, after removing the layer the code runs well.

cooooorn / Pytorch-XNOR-Net

Use the code with AlexNet but fail when saving binary model #9