How to make checkpoint and save the model?

leocd91 commented 4 years ago

great work. Clear documentation and easy to set up. How to make checkpoints and continue to train it later? I'm currently in a place where the electricity is unreliable.

IanTaehoonYoo commented 4 years ago

Hi, @leocd91

Thanks for the interest in this project. I added how to save and load checkpoints in the README.

Saving and loading check points.

The trainer class can save the check point automatically depends on the argument is called 'check_point_epoch_stride'. So check points will be saved for every epoch stride in the runs folder, ./segmentation/runs/models.

Also, you can load the check point using the logger class. Here are example codes, please refer to as bellow.

"""
Save check point.
Please check the runs folder, ./segmentation/runs/models
"""
check_point_stride = 30 # check points are saved for every 30 epochs.

trainer = Trainer(model, optimizer, logger, num_epochs,
                      train_loader, test_loader, check_point_epoch_stride=check_point_stride)

"""
Load check point.
"""
model_name = "pspnet_mobilenet_v2"
n_classes = 33
logger = Logger(model_name="pspnet_mobilenet_v2", data_name='example')

model = all_models.model_from_name[model_name](n_classes)
logger.load_models(model, 'epoch_253')

leocd91 commented 4 years ago

Hi, I tried your example, fixed some line.

Like : logger.load_models(model, 'epoch_253')

to

logger = Logger(model_name=model_name, data_name='test1')
Logger.load_models(logger,model, 'epoch_240')

Fixed the need more argument error but the model loaded gives me this error when accessing the size like your example on predict.py

AttributeError: 'PSPnet' object has no attribute 'img_height'

Here's my code ...

# Model
model_name = "pspnet_resnet50"
device = 'cuda'
batch_size = 1
n_classes = 6
num_epochs = 300
image_axis_minimum_size = 200
pretrained = True
fixed_feature = False
batch_norm = False if batch_size == 1 else True
model = all_models.model_from_name[model_name](n_classes,
                                               batch_norm=batch_norm,
                                               pretrained=pretrained,
                                               fixed_feature=fixed_feature)
logger = Logger(model_name=model_name, data_name='test1')
Logger.load_models(logger,model, 'epoch_240')    

model_width = model.img_width
model_height = model.img_height

if model_width != ori_width or model_height != ori_height:
  img = cv2.resize(img, (model_width, model_height), interpolation=cv2.INTER_NEAREST)

data = img.transpose((2, 0, 1))
data = data[None, :, :, :]
data = torch.from_numpy(data).float()

if next(model.parameters()).is_cuda:
  if not torch.cuda.is_available():
    raise ValueError("A model was trained via .cuda(), but this system can not support cuda.")
  data = data.cuda()

score = model(data)
lbl_pred = score.data.max(1)[1].cpu().numpy()[:, :, :]
lbl_pred = lbl_pred.transpose((1, 2, 0))
n_classes = np.max(lbl_pred)

Am I doing it wrong? I'm going to validate the model to another set of images/labels

IanTaehoonYoo commented 4 years ago

No, it was my fault. I updated the project. You should update this project and save the new checkpoint. Use this command: pip install --upgrade seg-torch I tested all models to use the loading and saving module. But if you should have problems, plz let me know.

In addition, the function name was changed.

Logger.load_models(logger,model, 'epoch_240')

to

Logger.load_model(logger,model, 'epoch_240')

Thanks Best regards,

leocd91 commented 4 years ago

Thank you for the response.

Still got some errors when loading the checkpoints after succesfully running it.

RuntimeError: Error(s) in loading state_dict for PSPnet:
    Missing key(s) in state_dict: "PSP.spatial_blocks.0.2.weight", "PSP.spatial_blocks.0.2.bias", "PSP.spatial_blocks.0.2.running_mean", "PSP.spatial_blocks.0.2.running_var", "PSP.spatial_blocks.1.2.weight", "PSP.spatial_blocks.1.2.bias", "PSP.spatial_blocks.1.2.running_mean", "PSP.spatial_blocks.1.2.running_var", "PSP.spatial_blocks.2.2.weight", "PSP.spatial_blocks.2.2.bias", "PSP.spatial_blocks.2.2.running_mean", "PSP.spatial_blocks.2.2.running_var", "PSP.spatial_blocks.3.2.weight", "PSP.spatial_blocks.3.2.bias", "PSP.spatial_blocks.3.2.running_mean", "PSP.spatial_blocks.3.2.running_var", "PSP.bottleneck.1.weight", "PSP.bottleneck.1.bias", "PSP.bottleneck.1.running_mean", "PSP.bottleneck.1.running_var", "upsampling1.layer.1.weight", "upsampling1.layer.1.bias", "upsampling1.layer.1.running_mean", "upsampling1.layer.1.running_var", "upsampling2.layer.1.weight", "upsampling2.layer.1.bias", "upsampling2.layer.1.running_mean", "upsampling2.layer.1.running_var", "upsampling3.layer.1.weight", "upsampling3.layer.1.bias", "upsampling3.layer.1.running_mean", "upsampling3.layer.1.running_var".

here's my code :

model_name = "pspnet_resnet50"
device = 'cuda'
batch_size = 1  
n_classes = 6
num_epochs = 22
check_point_stride = 21 
image_axis_minimum_size = 200
pretrained = True
fixed_feature = True
logger = Logger(model_name="pspnet_resnet50", data_name='test2')

model = all_models.model_from_name[model_name](n_classes)
logger.load_model(model, 'epoch_21')       

model.to(device)

# Loader
compose = transforms.Compose([
    Rescale(image_axis_minimum_size),
    ToTensor()
     ])

train_datasets = SegmentationDataset(train_images, train_labled, n_classes, compose)
train_loader = torch.utils.data.DataLoader(train_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

test_datasets = SegmentationDataset(test_images, test_labeled, n_classes, compose)
test_loader = torch.utils.data.DataLoader(test_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

trainer = Trainer(model, optimizer, logger, num_epochs, train_loader, test_loader,check_point_epoch_stride=check_point_stride)
trainer.train()

IanTaehoonYoo commented 4 years ago

Could you check the Logger's arguments? When you load the checkpoint, 'model_name' and 'data_name' should be the same as when you train the model.

leocd91 commented 4 years ago

Yeah it's the same,

Strangely, after I use this code instead, it's working 🤣

    model_name = "pspnet_resnet50"
    device = 'cuda'
    batch_size = 1  
    n_classes = 6
    num_epochs = 22
    check_point_stride = 21 
    image_axis_minimum_size = 200
    pretrained = True
    fixed_feature = True

    logger = Logger(model_name=model_name, data_name='test3')

    # Loader
    compose = transforms.Compose([
        Rescale(image_axis_minimum_size),
        ToTensor()
         ])

    train_datasets = SegmentationDataset(train_images, train_labled, n_classes, compose)
    train_loader = torch.utils.data.DataLoader(train_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

    test_datasets = SegmentationDataset(test_images, test_labeled, n_classes, compose)
    test_loader = torch.utils.data.DataLoader(test_datasets, batch_size=batch_size, shuffle=True, drop_last=True)

     # Model
    batch_norm = False if batch_size == 1 else True
    model = all_models.model_from_name[model_name](n_classes,
                                                   batch_norm=batch_norm,
                                                   pretrained=pretrained,
                                                   fixed_feature=fixed_feature)
    logger.load_model(model, 'epoch_21')  
    model.to(device)

Thank you..

IanTaehoonYoo commented 4 years ago

I guess you trained the model setting batch_norm = True and you loaded the model was set batch_norm=False. It will be changed the network layers depend on batch_norm. Sorry for the error...

Thanks,

IanTaehoonYoo / semantic-segmentation-pytorch

How to make checkpoint and save the model? #2