About multiGPU - Githubissues

yaoing commented 5 years ago

Hello,I try to add Parallel in train.py (cus train_dist.py can't really work fine on my cluster) by add DataParallel function and move the optim under net

net.to(device)
    net = torch.nn.DataParallel(net,device_ids=[0,1,2,3,4,6,7,8])
    optimizer = optim.SGD(net.parameters(), lr=cfg.TRAIN.LEARNING_RATE,
                          momentum=cfg.TRAIN.MOMENTUM, weight_decay=cfg.TRAIN.WEIGHT_DECAY)
net.train()
train(net, optimizer, imdb, roidb, arg)

and comment

    #assert len(str(arg.gpu_ids)) == 1, "only single gpu is supported, " \
                  #                     "use train_dist.py for multiple gpu support"

    # os.environ['CUDA_VISIBLE_DEVICES'] = str(arg.gpu_ids)

but it still run on only device 0

what wrong with my code? 感谢~

dechunwang commented 5 years ago

Hi yaoing, Dataparallel library in pytorch works as follows: splitting the input across the specified devices by chunking in the batch dimension (other objects will be copied once per device).

The batch size should be larger than the number of GPUs used.

Currently my code do not support batch size other than one, so that you would not able to use Dataparallel. If you want to use Dataparallel, you need to modify data loader, anchor layer in order to support multi batch. If you have a cluster and want to train on multiple gpu, try to use distributed training, look at example in train_dist.sh

yaoing commented 5 years ago

OK! I know. Thanks~

yaoing commented 5 years ago

Hello,dechunwang,bother again.

Have you provide any saved model for us to test?

Here have some bugs when save checkpoints on my centos cluster:

THCudaCheck FAIL file=/pytorch/torch/csrc/generic/serialization.cpp line=15 error=30 : unknown error
Traceback (most recent call last):
  File "train.py", line 222, in <module>
    train(net, optimizer, imdb, roidb, arg)
  File "train.py", line 183, in train
    print("check point saved")
  File "/home/yao/apps/SSH-pytorch/model/network.py", line 116, in save_check_point
    }, path)
  File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 218, in save
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 143, in _with_file_like
    return body(f)
  File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 218, in <lambda>
    return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
  File "/home/yao/anaconda3/envs/yao/lib/python3.6/site-packages/torch/serialization.py", line 297, in _save
    serialized_storages[key]._write_file(f, _should_read_directly(f))
RuntimeError: cuda runtime error (30) : unknown error at /pytorch/torch/csrc/generic/serialization.cpp:15

But at the same time,I can save other model coding by myself.

dechunwang commented 5 years ago

Please check whether model save directory exists or not. I am traveling right now, I will upload as soon as I get back. But you should able to get same results in 4 GPU, 22000 tiers. When you evaluate, set threshold to 0.05

yaoing commented 5 years ago

好人一生平安!

dechunwang commented 5 years ago

https://drive.google.com/file/d/19bmuol6CbSqL3pj9SBzUL6UhrxC3XYbC/view Sorry for late reply

westnight commented 5 years ago

Please check whether model save directory exists or not. I am traveling right now, I will upload as soon as I get back. But you should able to get same results in 4 GPU, 22000 tiers. When you evaluate, set threshold to 0.05

hi，if you set threshold to 0.05, there will be many wrong bboxes which cannot be used in practice.

dechunwang commented 5 years ago

This is a very common threshold setting used in wider face benchmark. It boots up recall. The official SSH repo also used this setting.

westnight commented 5 years ago

ok, I know, thank you. So when test on wider face, what we care about is recall. When I test mtcnn, the threshold can be set very low too, like 0.05?

dechunwang / SSH-pytorch

About multiGPU #2