analogdevicesinc / ai8x-synthesis

Quantization and Synthesis (Device Specific Code Generation) for ADI's MAX78000 and MAX78002 Edge AI Devices
Apache License 2.0
55 stars 49 forks source link

BUG:Using multiple GPUs to train a model will cause model evaluation errors!!! #172

Closed ZhugeKongan closed 2 years ago

ZhugeKongan commented 2 years ago

I want to know what is the purpose of update_old_model_params in train.py? ` elif args.load_model_path:

print('2222')

    update_old_model_params(args.load_model_path, model)
    if qat_policy is not None:
        checkpoint = torch.load(args.load_model_path,
                                map_location=lambda storage, loc: storage)
        if checkpoint.get('epoch', None) >= qat_policy['start_epoch']:
            ai8x.fuse_bn_layers(model)
    model = apputils.load_lean_checkpoint(model, args.load_model_path,
                                          model_device=args.device)
    ai8x.update_model(model)`

This can lead to incorrect parameter loading when using multi-GPU training. This may require optimization.

seldauyanik-maxim commented 2 years ago

update_old_model_params will be in use when train.py's --resume-from option is provided to resume from previous checkpoint and the checkpoint was created using an earlier version of repository. The implementation handles 'module.' prefix in the state dictionary items if DataParallel is in use with multiple GPU systems and not exist in single GPU systems.

I could not reproduce an evaluation problem in a multiple GPU system:

A sample model is trained in multi-GPU system by: ./scripts/train_mnist.sh The trained model is evaluated in the same system by: ./scripts/evaluate_mnist.sh The trained model (/ai8x-synthesis/trained/ai85-mnist-qat8-q.pth.tar) is copied to (same location) a single GPU system and again evaluated by ./scripts/evaluate_mnist.sh.

seldauyanik-maxim commented 2 years ago

Could not reproduce any evaluation error in multi GPU systems. Closing the issue.