SamsungLabs / fbrs_interactive_segmentation

[CVPR2020] f-BRS: Rethinking Backpropagating Refinement for Interactive Segmentation https://arxiv.org/abs/2001.10331
Mozilla Public License 2.0
583 stars 94 forks source link

Error happens with worker>=1 in training phase #22

Closed Shaosifan closed 4 years ago

Shaosifan commented 4 years ago

when I run train.py with worker=4, there is a error happen. Any body knows it?

Traceback (most recent call last): File "F:/research/codes/Others-projects/fbrs_interactive_segmentation-master/train.py", line 69, in main() File "F:/research/codes/Others-projects/fbrs_interactive_segmentation-master/train.py", line 16, in main model_script.main(cfg) File "models/sbd/r34_dh128.py", line 24, in main train(model, cfg, model_cfg, start_epoch=cfg.start_epoch) File "models/sbd/r34_dh128.py", line 132, in train trainer.training(epoch) File "F:\research\codes\Others-projects\fbrs_interactive_segmentation-master\isegm\engine\trainer.py", line 119, in training for i, batch_data in enumerate(tbar): File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\tqdm\std.py", line 1129, in iter for obj in iterable: File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\torch\utils\data\dataloader.py", line 279, in iter return _MultiProcessingDataLoaderIter(self) File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\torch\utils\data\dataloader.py", line 719, in init w.start() File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\process.py", line 112, in start self._popen = self._Popen(self) File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\popen_spawn_win32.py", line 89, in init reduction.dump(process_obj, to_child) File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'train..scale_func'

ptrvilya commented 4 years ago

Hi! I suggest that this issue comes from torch.multiprocessing implementation on Windows OS, consider moving scale_func out of train function to the top level of training script.

Shaosifan commented 4 years ago

Hi! I suggest that this issue comes from torch.multiprocessing implementation on Windows OS, consider moving scale_func out of train function to the top level of training script.

When I move scale_func to the top level of training script, a similar issue comes out:

Traceback (most recent call last):
  File "F:/research/codes/Others-projects/fbrs_interactive_segmentation-master/train.py", line 69, in <module>
    main()
  File "F:/research/codes/Others-projects/fbrs_interactive_segmentation-master/train.py", line 16, in main
    model_script.main(cfg)
  File "models/sbd/r34_dh128.py", line 26, in main
    train(model, cfg, model_cfg, start_epoch=cfg.start_epoch)
  File "models/sbd/r34_dh128.py", line 135, in train
    trainer.training(epoch)
  File "F:\research\codes\Others-projects\fbrs_interactive_segmentation-master\isegm\engine\trainer.py", line 119, in training
    for i, batch_data in enumerate(tbar):
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\tqdm\std.py", line 1129, in __iter__
    for obj in iterable:
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\torch\utils\data\dataloader.py", line 279, in __iter__
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\site-packages\torch\utils\data\dataloader.py", line 719, in __init__
    w.start()
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\ll\Anaconda3\envs\fbrs\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function scale_func at 0x000001CDBF3F38B8>: import of module 'model_script' failed

Process finished with exit code 1
ptrvilya commented 4 years ago

I believe that the issue is still with torch.multiprocessing on Windows. I suggest you to substitute this line with this one and remove scale_func completely. Also you can try running training using nvidia-docker with Ubuntu.

Shaosifan commented 4 years ago

Thank your for your help! I remove the scale_func completely and it works out.