fastai / course-v3

The 3rd edition of course.fast.ai
https://course.fast.ai/
Apache License 2.0
4.9k stars 3.55k forks source link

lesson 7 superres cuda error #126

Closed Data-drone closed 5 years ago

Data-drone commented 5 years ago

When trying to run Lesson 7 superres

learn.lr_find

in the train section gives:

RuntimeErrorTraceback (most recent call last)
<ipython-input-20-d81c6bd29d71> in <module>()
----> 1 learn.lr_find()

/usr/local/lib/python3.6/site-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, **kwargs)
     29     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     30     a = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 31     learn.fit(a, start_lr, callbacks=[cb], **kwargs)
     32 
     33 def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:

/usr/local/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    164         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    165         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 166             callbacks=self.callbacks+callbacks)
    167 
    168     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/usr/local/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     92     except Exception as e:
     93         exception = e
---> 94         raise e
     95     finally: cb_handler.on_train_end(exception)
     96 

/usr/local/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     82             for xb,yb in progress_bar(data.train_dl, parent=pbar):
     83                 xb, yb = cb_handler.on_batch_begin(xb, yb)
---> 84                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     85                 if cb_handler.on_batch_end(loss): break
     86 

/usr/local/lib/python3.6/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     16     if not is_listy(xb): xb = [xb]
     17     if not is_listy(yb): yb = [yb]
---> 18     out = model(*xb)
     19     out = cb_handler.on_loss_begin(out)
     20 

/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/site-packages/fastai/layers.py in forward(self, x)
    113         for l in self.layers:
    114             res.orig = x
--> 115             nres = l(res)
    116             # We have to remove res.orig to avoid hanging refs and therefore memory leaks
    117             res.orig = None

/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    318     def forward(self, input):
    319         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 320                         self.padding, self.dilation, self.groups)
    321 
    322 

RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:405

this is with 1.0.37 has anyone else gotten this error?

stas00 commented 5 years ago

You're running out of CUDA memory. Either you have other open notebooks taking up the GPU RAM, or you have too little to start with. Use nvidia-smi to debug this.

If you still can't figure it out post in https://forums.fast.ai/c/part1-v3, including the output of:

python -c 'import fastai; fastai.show_install(1)'
Data-drone commented 5 years ago

I think it was actually a cuda 9.1 issue. I moved to cuda 9.2 and now it seems fine

yang-yk commented 5 years ago

I have the same question with you. My cuda version is 9.0. When I run the pytorch example code of MNIST, everything is ok. But when I run the pytorch example code of imagenet. this problem appears. Can you tell me how to solve this problem. Thanks.

stas00 commented 5 years ago

@yang-yk, please post at forums, each lesson has its own thread where you can ask for help, describing specifically what doesn't work, showing your code if it's custom one or which notebook you have a difficulty with if it's unmodified. And including your setup information (3 comments up).