Open R00Kie-Liu opened 5 years ago
Thanks for this question!
I think multi-GPU works just like single-GPU. Since our search on CIFAR takes a few hours, we did not consider multi-GPU training. However, during our recent work that generalizes P-DARTS in searching on ImageNet directly, we did use 8 GPUs for acceleration.
@chenxin061 more experiences to share?
To search with multiple GPUs, you need to change a few lines in train_search.py.
model = nn.DataParallel(model)
before model = model.cuda()
and model = model.module
after it.要使用多个GPU进行搜索,您需要在train_search.py中更改几行。
- 删除与GPU ID设置相关的所有行。相反,您可以使用CUDA_VISIBLE_DEVICES设置GPU ID。
- 在它
model = nn.DataParallel(model)
之前model = model.cuda()
和model = model.module
之后添加。To search with multiple GPUs, you need to change a few lines in train_search.py.
- Delete all lines related to GPU ID setting. Instead, you can set GPU ids with CUDA_VISIBLE_DEVICES.
- Add
model = nn.DataParallel(model)
beforemodel = model.cuda()
andmodel = model.module
after it.
I added code model = nn.DataParallel(model)
to file train_search.py
Traceback (most recent call last):
File "train_search.py", line 469, in
@zihaozhang9 to fix that Change model.arch_parameters() to model.module.arch_parameters()
I did several thing .. comment set device
and then
model = nn.DataParallel(model) model = model.cuda() model = model.module
then os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2,3,4,5,6,7' parser.add_argument('--batch_size', type=int, default=192, help='batch size')
but still I can not train_search.py on multi gpu, it will still try to overwhelm single gpu then out of memory .... What is wrong here ... ?
I am using pytorch1.0.0 python3.6 and get 4 by print(torch.cuda.device_count())
If I use
model = nn.DataParallel(model) model = model.cuda()
and model.module.arch_parameters()
I will get this error ....
The new version of our code now supports multi-GPU search!
@JarveeLee You can try it.
Use CUDA_VISIBLE_DEVICES
to assign GPU ids.
I see your modification, I did the same to support multi gpu ,what is more , in
class MixedOp(nn.Module): def forward(self, x, weights): return sum(w * op(x) for w, op in zip(weights, self.m_ops))
shall change to
class MixedOp(nn.Module): def forward(self, x, weights): return sum(w.cuda() * op(x.cuda()) for w, op in zip(weights, self.m_ops))
other wise the error that I encountered will still happen... I am working on a complex awful GPU server , hard to control enviroment , that is my experience...
@chenxin061 Thanks for sharing your code! Can you confirm whether you used 8 V100 GPUs with 16 GB of memory per card or 8 V100 GPUs with 32 GB memory per card? Thanks!
@davidrpugh The search code is tested on two P100 GPUs and the evaluating code is tested on 8 V100 with 16GB memory.
@chenxin061 Thanks! I suspected as much for the V100s. Didn't realize that you used 2 P100s. I was able to complete the search process using CIFAR-10 or CIFAR-100 using a single P100 with 16 GB in between 7-8 hours (as advertised in the paper and README).
How to use Multi-gpu to search?