fastai / fastai

The fastai deep learning library
http://docs.fast.ai
Apache License 2.0
26.23k stars 7.56k forks source link

ConvLearner fit function does not work #803

Closed cSchubes closed 6 years ago

cSchubes commented 6 years ago

Hi,

I am following lesson 1 of the DLC course and have encountered what appears to be a bug in FastAI itself. The following lines run the training, but fail after for something related to numpy:

arch=resnet34
data = ImageClassifierData.from_paths(PATH, bs=8, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 2)

Batch size is 8 because I have only 2GB on my GPU. The error is below:

TypeError                                 Traceback (most recent call last)
<ipython-input-11-676345a6c308> in <module>()
      2 data = ImageClassifierData.from_paths(PATH, bs=8, tfms=tfms_from_model(arch, sz))
      3 learn = ConvLearner.pretrained(arch, data, precompute=True)
----> 4 learn.fit(0.01, 2)

~/fastai/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    300         self.sched = None
    301         layer_opt = self.get_layer_opt(lrs, wds)
--> 302         return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    303 
    304     def warm_up(self, lr, wds=None):

~/fastai/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
    247             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
    248             swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
--> 249             swa_eval_freq=swa_eval_freq, **kwargs)
    250 
    251     def get_layer_groups(self): return self.models.get_layer_groups()

~/fastai/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, visualize, **kwargs)
    160 
    161         if not all_val:
--> 162             vals = validate(model_stepper, cur_data.val_dl, metrics, epoch, seq_first=seq_first, validate_skip = validate_skip)
    163             stop=False
    164             for cb in callbacks: stop = stop or cb.on_epoch_end(vals)

~/fastai/fastai/model.py in validate(stepper, dl, metrics, epoch, seq_first, validate_skip)
    240             loss.append(to_np(l))
    241             res.append([f(datafy(preds), datafy(y)) for f in metrics])
--> 242     return [np.average(loss, 0, weights=batch_cnts)] + list(np.average(np.stack(res), 0, weights=batch_cnts))
    243 
    244 def get_prediction(x):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/numpy/lib/function_base.py in average(a, axis, weights, returned)
    381             wgt = wgt.swapaxes(-1, axis)
    382 
--> 383         scl = wgt.sum(axis=axis, dtype=result_dtype)
    384         if np.any(scl == 0.0):
    385             raise ZeroDivisionError(

~/anaconda3/envs/fastai/lib/python3.6/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims, initial)
     34 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     35          initial=_NoValue):
---> 36     return umr_sum(a, axis, dtype, out, keepdims, initial)
     37 
     38 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: No loop matching the specified signature and casting
was found for ufunc add

This looks like something internal to FastAI and its interactions with Numpy. Any help would be greatly appreciated!

chasak commented 6 years ago

Getting the same error. I was able to solve it by updating the repo on my local machine and set learn.metrics = [] The relevant discussion and answer by fizx is here http://forums.fast.ai/t/typeerror-on-the-first-exemple-in-lesson-1-no-loop-matching-the-specified-signature-and-casting/22403/5

jph00 commented 6 years ago

Is this on py37? If so, there's a known bug with numpy.

cSchubes commented 6 years ago

Python 3.6 actually. learn.metrics = [] works, but then I don't get nice metrics... and if I try and use metrics, it fails.

sgugger commented 6 years ago

I didn't have the bug (which may due to the fact my pytorch is 0.4) but I think the fix from @ncihnegn should work. Please reopen if it's not the case.

Sammy-iiitb commented 5 years ago

If that doesn't work now "create_cnn" can be used in place of ConvLearner

zszazi commented 5 years ago

So there are two options as ConvLearner doesnt work now Replace ConvLearner by

  1. create_cnn
  2. cnn_learner - link