FishYuLi / BalancedGroupSoftmax

CVPR 2020 oral paper: Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax.
Apache License 2.0
352 stars 64 forks source link

OSError: ./data/weneed/mask_r50/epoch_12.pth is not a checkpoint file #19

Open hhhhhhxl opened 3 years ago

hhhhhhxl commented 3 years ago

I followed the steps and tried to train with 2 GPUs and got this error. ./tools/dist_train.sh ./configs/bags/gs_mask_rcnn_r50_fpn_1x_lvis.py 2


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[07/09 11:22:20] root WARNING: The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

Train fc_cls only. --Dist-train--IS:False--ISout:False Dist-train --- Not using image sampling. Train fc_cls only. --Dist-train--IS:False--ISout:False Dist-train --- Not using image sampling. [07/09 11:22:39] mmcv.runner.runner INFO: load checkpoint from ./data/weneed/mask_r50/epoch_12.pth Traceback (most recent call last): File "./tools/train.py", line 169, in main() File "./tools/train.py", line 165, in main logger=logger) File "/home/sy/Desktop/xlhuang/BalancedGroupSoftmax-master/mmdet/apis/train.py", line 58, in train_detector _dist_train(model, dataset, cfg, validate=validate) File "/home/sy/Desktop/xlhuang/BalancedGroupSoftmax-master/mmdet/apis/train.py", line 204, in _dist_train runner.load_checkpoint(cfg.load_from) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/runner.py", line 234, in load_checkpoint self.logger) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 171, in load_checkpoint raise IOError('{} is not a checkpoint file'.format(filename)) OSError: ./data/weneed/mask_r50/epoch_12.pth is not a checkpoint file [07/09 11:22:39] mmcv.runner.runner INFO: load checkpoint from ./data/weneed/mask_r50/epoch_12.pth Traceback (most recent call last): File "./tools/train.py", line 169, in main() File "./tools/train.py", line 165, in main logger=logger) File "/home/sy/Desktop/xlhuang/BalancedGroupSoftmax-master/mmdet/apis/train.py", line 58, in train_detector _dist_train(model, dataset, cfg, validate=validate) File "/home/sy/Desktop/xlhuang/BalancedGroupSoftmax-master/mmdet/apis/train.py", line 204, in _dist_train runner.load_checkpoint(cfg.load_from) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/runner.py", line 234, in load_checkpoint self.logger) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 171, in load_checkpoint raise IOError('{} is not a checkpoint file'.format(filename)) OSError: ./data/weneed/mask_r50/epoch_12.pth is not a checkpoint file Traceback (most recent call last): File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/sy/anaconda3/envs/mmdet/bin/python', '-u', './tools/train.py', '--local_rank=1', './configs/bags/gs_mask_rcnn_r50_fpn_1x_lvis.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

hhhhhhxl commented 3 years ago

it seems this file './data/weneed/mask_r50/epoch_12.pth' is missing.

xiaohe6 commented 2 years ago

I followed the steps and tried to train with 2 GPUs and got this error. ./tools/dist_train.sh ./configs/bags/gs_mask_rcnn_r50_fpn_1x_lvis.py 2

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

[07/09 11:22:20] root WARNING: The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

Train fc_cls only. --Dist-train--IS:False--ISout:False Dist-train --- Not using image sampling. Train fc_cls only. --Dist-train--IS:False--ISout:False Dist-train --- Not using image sampling. [07/09 11:22:39] mmcv.runner.runner INFO: load checkpoint from ./data/weneed/mask_r50/epoch_12.pth Traceback (most recent call last): File "./tools/train.py", line 169, in main() File "./tools/train.py", line 165, in main logger=logger) File "/home/sy/Desktop/xlhuang/BalancedGroupSoftmax-master/mmdet/apis/train.py", line 58, in train_detector _dist_train(model, dataset, cfg, validate=validate) File "/home/sy/Desktop/xlhuang/BalancedGroupSoftmax-master/mmdet/apis/train.py", line 204, in _dist_train runner.load_checkpoint(cfg.load_from) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/runner.py", line 234, in load_checkpoint self.logger) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 171, in load_checkpoint raise IOError('{} is not a checkpoint file'.format(filename)) OSError: ./data/weneed/mask_r50/epoch_12.pth is not a checkpoint file [07/09 11:22:39] mmcv.runner.runner INFO: load checkpoint from ./data/weneed/mask_r50/epoch_12.pth Traceback (most recent call last): File "./tools/train.py", line 169, in main() File "./tools/train.py", line 165, in main logger=logger) File "/home/sy/Desktop/xlhuang/BalancedGroupSoftmax-master/mmdet/apis/train.py", line 58, in train_detector _dist_train(model, dataset, cfg, validate=validate) File "/home/sy/Desktop/xlhuang/BalancedGroupSoftmax-master/mmdet/apis/train.py", line 204, in _dist_train runner.load_checkpoint(cfg.load_from) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/runner.py", line 234, in load_checkpoint self.logger) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 171, in load_checkpoint raise IOError('{} is not a checkpoint file'.format(filename)) OSError: ./data/weneed/mask_r50/epoch_12.pth is not a checkpoint file Traceback (most recent call last): File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main() File "/home/sy/anaconda3/envs/mmdet/lib/python3.7/site-packages/torch/distributed/launch.py", line 249, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/sy/anaconda3/envs/mmdet/bin/python', '-u', './tools/train.py', '--local_rank=1', './configs/bags/gs_mask_rcnn_r50_fpn_1x_lvis.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

Hello, I can't download this ./data/download_models/faster_rcnn_r50_fpn_2x_20181010-443129e1.pth file right now, I dare ask if you have downloaded it