HabanaAI / Model-References

Reference models for Intel(R) Gaudi(R) AI Accelerator
155 stars 81 forks source link

Unet - inference issues #42

Open kkurzacz-intel opened 5 months ago

kkurzacz-intel commented 5 months ago

I'm getting error when running UNet2D inference:

root@ip-172-31-0-126:/Model-References/PyTorch/computer_vision/segmentation/Unet# python main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --seed 123 --val_batch_size 64 --dim 2 --data=/data/pytorch/unet/01_2d --results=/tmp/Unet/results/fold_3 --autocast --inference_mode lazy --ckpt_path pretrained_checkpoint/pretrained_checkpoint.pt
Namespace(framework='pytorch-lightning', exec_mode='predict', data='/data/pytorch/unet/01_2d', results='/tmp/Unet/results/fold_3', logname=None, task='01', gpus=0, hpus=1, learning_rate=0.001, gradient_clip_val=0, negative_slope=0.01, tta=False, gradient_clip=False, gradient_clip_norm=12, amp=False, benchmark=False, deep_supervision=False, drop_block=False, attention=False, residual=False, focal=False, sync_batchnorm=False, save_ckpt=False, nfolds=5, seed=123, skip_first_n_eval=0, ckpt_path='pretrained_checkpoint/pretrained_checkpoint.pt', fold=3, patience=100, lr_patience=70, batch_size=2, val_batch_size=64, steps=None, profile=False, profile_steps='90:95', momentum=0.99, weight_decay=0.0001, save_preds=False, dim=2, resume_training=False, factor=0.3, num_workers=8, min_epochs=30, max_epochs=10000, warmup=5, norm='instance', nvol=1, run_lazy_mode=True, inference_mode='lazy', is_autocast=True, hpu_graphs=True, habana_loader=False, bucket_cap_mb=130, data2d_dim=3, oversampling=0.33, overlap=0.5, affinity='disabled', scheduler='none', optimizer='adamw', blend='gaussian', train_batches=0, test_batches=0, progress_bar_refresh_rate=25, set_aug_seed=False, augment=True, measurement_type='throughput', use_torch_compile=False, enable_tensorboard_logging=False)
Seed set to 123
Seed set to 123
Seed set to 123
Seed set to 773630
Number of test examples: 266
Seed set to 28030
Traceback (most recent call last):
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/main.py", line 218, in <module>
    main()
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/main.py", line 209, in main
    ptlrun(args)
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/lightning_trainer/ptl.py", line 211, in ptlrun
    model = NNUnet.load_from_checkpoint(ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/model_helpers.py", line 125, in wrapper
    return self.method(cls, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/module.py", line 1581, in load_from_checkpoint
    loaded = _load_from_checkpoint(
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/saving.py", line 91, in _load_from_checkpoint
    model = _load_state(cls, checkpoint, strict=strict, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/saving.py", line 158, in _load_state
    obj = cls(**_cls_kwargs)
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/models/nn_unet.py", line 72, in __init__
    self.build_nnunet()
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/models/nn_unet.py", line 189, in build_nnunet
    in_channels, n_class, kernels, strides, self.patch_size = get_unet_params(self.args)
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/utils/utils.py", line 132, in get_unet_params
    config = get_config_file(args)
  File "/Model-References/PyTorch/computer_vision/segmentation/Unet/utils/utils.py", line 102, in get_config_file
    return pickle.load(open(path, "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/weka/data/pytorch/unet/01_2d/config.pkl'

This command comes from README examples (Single Card Inference Examples / Inference / UNet2D, Lazy mode, BF16 mixed precision, batch size 64, 1 HPU on a single server).

Environment:

Environment is AWS DL1 instance. I followed Gaudi AWS quickstart to start instance and run Docker Habana runtime environment.

Command for benchmark inference:

$PYTHON main.py --exec_mode predict --task 01 --hpus 1 --fold 3 --val_batch_size 64 --dim 2 --data=/data/pytorch/unet/01_2d --results=/tmp/Unet/results/fold_3 --autocast --inference_mode lazy --benchmark --test_batches 150

works without errors.

Alberto-Villarreal commented 3 days ago

@kkurzacz-intel Could you please point us to the command you used from https://github.com/HabanaAI/Model-References/tree/master/PyTorch/computer_vision/segmentation/Unet#single-card-inference-examples ? The one that produced the error above?