chenxin061 / pdarts

Codes for our paper "Progressive Differentiable Architecture Search:Bridging the Depth Gap between Search and Evaluation"
Other
360 stars 83 forks source link

Multi-gpu support #5

Open R00Kie-Liu opened 5 years ago

R00Kie-Liu commented 5 years ago

How to use Multi-gpu to search?

198808xc commented 5 years ago

Thanks for this question!

I think multi-GPU works just like single-GPU. Since our search on CIFAR takes a few hours, we did not consider multi-GPU training. However, during our recent work that generalizes P-DARTS in searching on ImageNet directly, we did use 8 GPUs for acceleration.

@chenxin061 more experiences to share?

chenxin061 commented 5 years ago

To search with multiple GPUs, you need to change a few lines in train_search.py.

  1. Delete all lines related to GPU ID setting. Instead, you can set GPU ids with CUDA_VISIBLE_DEVICES.
  2. Add model = nn.DataParallel(model) before model = model.cuda() and model = model.module after it.
zihaozhang9 commented 5 years ago

要使用多个GPU进行搜索,您需要在train_search.py​​中更改几行。

  1. 删除与GPU ID设置相关的所有行。相反,您可以使用CUDA_VISIBLE_DEVICES设置GPU ID。
  2. 在它model = nn.DataParallel(model)之前model = model.cuda()model = model.module之后添加。

To search with multiple GPUs, you need to change a few lines in train_search.py.

  1. Delete all lines related to GPU ID setting. Instead, you can set GPU ids with CUDA_VISIBLE_DEVICES.
  2. Add model = nn.DataParallel(model) before model = model.cuda() and model = model.module after it.

I added code model = nn.DataParallel(model) to file train_search.py

Traceback (most recent call last): File "train_search.py", line 469, in main() File "train_search.py", line 142, in main optimizer_a = torch.optim.Adam(model.arch_parameters(), File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 518, in getattr type(self).name, name)) AttributeError: 'DataParallel' object has no attribute 'arch_parameters'

anhcda-study commented 5 years ago

@zihaozhang9 to fix that Change model.arch_parameters() to model.module.arch_parameters()

JarveeLee commented 5 years ago

I did several thing .. comment set device

torch.cuda.set_device(args.gpu)

and then

model = nn.DataParallel(model) model = model.cuda() model = model.module

then os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3,4,5,6,7' parser.add_argument('--batch_size', type=int, default=192, help='batch size')

but still I can not train_search.py on multi gpu, it will still try to overwhelm single gpu then out of memory .... What is wrong here ... ?

I am using pytorch1.0.0 python3.6 and get 4 by print(torch.cuda.device_count())

If I use

model = nn.DataParallel(model) model = model.cuda()

model = model.module

and model.module.arch_parameters()

I will get this error .... image

chenxin061 commented 5 years ago

The new version of our code now supports multi-GPU search! @JarveeLee You can try it. Use CUDA_VISIBLE_DEVICES to assign GPU ids.

JarveeLee commented 5 years ago

I see your modification, I did the same to support multi gpu ,what is more , in

class MixedOp(nn.Module): def forward(self, x, weights): return sum(w * op(x) for w, op in zip(weights, self.m_ops))

shall change to

class MixedOp(nn.Module): def forward(self, x, weights): return sum(w.cuda() * op(x.cuda()) for w, op in zip(weights, self.m_ops))

other wise the error that I encountered will still happen... I am working on a complex awful GPU server , hard to control enviroment , that is my experience...

davidrpugh commented 5 years ago

@chenxin061 Thanks for sharing your code! Can you confirm whether you used 8 V100 GPUs with 16 GB of memory per card or 8 V100 GPUs with 32 GB memory per card? Thanks!

chenxin061 commented 5 years ago

@davidrpugh The search code is tested on two P100 GPUs and the evaluating code is tested on 8 V100 with 16GB memory.

davidrpugh commented 5 years ago

@chenxin061 Thanks! I suspected as much for the V100s. Didn't realize that you used 2 P100s. I was able to complete the search process using CIFAR-10 or CIFAR-100 using a single P100 with 16 GB in between 7-8 hours (as advertised in the paper and README).