chenxin061 / pdarts

Codes for our paper "Progressive Differentiable Architecture Search:Bridging the Depth Gap between Search and Evaluation"
Other
359 stars 83 forks source link

Is your method only work on CUDA 10? #23

Open kids0cn opened 4 years ago

kids0cn commented 4 years ago

Hi there, I try to reproduce your code on NVIDIA V100,CUDA 9,but it can not work。

Experiment dir : /home/limingnie/logsearch-note_of_this_run-20191027-142718
10/27 02:27:18 PM args = Namespace(add_layers=['0', '6', '12'], add_width=['0'], arch_learning_rate=0.0006, arch_weight_decay=0.001, batch_size=64, cifar100=False, cutout=False, cutout_length=16, drop_path_prob=0.3, dropout_rate=['0.1', '0.4', '0.7'], epochs=25, grad_clip=5, init_channels=16, layers=5, learning_rate=0.025, learning_rate_min=0.0, momentum=0.9, note='note_of_this_run', report_freq=50, save='/home/limingnie/logsearch-note_of_this_run-20191027-142718', seed=2, tmp_data_dir='/home/limingnie/cifar-10-batches-py', train_portion=0.5, weight_decay=0.0003, workers=2)
Files already downloaded and verified
10/27 02:27:33 PM param size = 1.276058MB
/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
10/27 02:27:33 PM Epoch: 0 lr: 2.490143e-02
Traceback (most recent call last):
  File "train_search.py", line 465, in <module>
    main() 
  File "train_search.py", line 155, in main
    train_acc, train_obj = train(train_queue, valid_queue, model, network_params, criterion, optimizer, optimizer_a, lr, train_arch=False)
  File "train_search.py", line 292, in train
    logits = model(input)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 148, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
    res = scatter_map(inputs)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/cuda/comm.py", line 147, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: out of memory (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:241)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x2b36b153c813 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1cb50 (0x2b36af129b50 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1de6e (0x2b36af12ae6e in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x279 (0x2b36f19a1eb9 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x41c27c8 (0x2b36f03ae7c8 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x3c7beb8 (0x2b36efe67eb8 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #6: <unknown function> + 0x1bd17b1 (0x2b36eddbd7b1 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #7: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) + 0x272 (0x2b36eddbe152 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x1efadd0 (0x2b36ee0e6dd0 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #9: <unknown function> + 0x3a8db03 (0x2b36efc79b03 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #10: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef<long>, c10::optional<std::vector<long, std::allocator<long> > > const&, long, c10::optional<std::vector<c10::optional<c10::cuda::CUDAStream>, std::allocator<c10::optional<c10::cuda::CUDAStream> > > > const&) + 0x4db (0x2b36f07a438b in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #11: <unknown function> + 0x7846a3 (0x2b36ebbd06a3 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x20ffc4 (0x2b36eb65bfc4 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: THPFunction_apply(_object*, _object*) + 0x936 (0x2b36eb8ecb86 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
chenxin061 commented 4 years ago

Our test environment was CUDA10, Python 3.6 and PyTorch 0.4 and 1.0. However, one of my colleagues tested the code with CUDA 9 and it worked well. Since the error in your log is an OOM error, I suggest you check those hyper-parameters first, e. g., the batch size. Also, this code has not yet been tested on Python 3.7.

Jeffrey-JDong commented 4 years ago

Hi,I have the same question.I try reproduce the code on CUDA 9 python 3.6 pytorch 0.4,but it still out of memory.