KeyError on learn.lr_find()

lvaleriu commented 4 years ago

I'm getting this error when calling: learn.lr_find() for a text_classifier_learner just after initialising it. This is the first time I'm seeing it (a few hours ago it worked well).

`--------------------------------------------------------------------------- KeyError Traceback (most recent call last)

in ----> 1 learn.lr_find() ~/workspace/fastai2/fastai2/callback/schedule.py in lr_find(self, start_lr, end_lr, num_it, stop_div, show_plot, suggestions) 195 n_epoch = num_it//len(self.dls.train) + 1 196 cb=LRFinder(start_lr=start_lr, end_lr=end_lr, num_it=num_it, stop_div=stop_div) --> 197 with self.no_logging(): self.fit(n_epoch, cbs=cb) 198 if show_plot: self.recorder.plot_lr_find() 199 if suggestions: ~/workspace/fastai2/fastai2/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt) 180 181 except CancelFitException: self('after_cancel_fit') --> 182 finally: self('after_fit') 183 184 def validate(self, ds_idx=1, dl=None, cbs=None): ~/workspace/fastai2/fastai2/learner.py in __call__(self, event_name) 106 def ordered_cbs(self, cb_func): return [cb for cb in sort_by_run(self.cbs) if hasattr(cb, cb_func)] 107 --> 108 def __call__(self, event_name): L(event_name).map(self._call_one) 109 def _call_one(self, event_name): 110 assert hasattr(event, event_name) ~/workspace/fastcore/fastcore/foundation.py in map(self, f, *args, **kwargs) 360 else f.format if isinstance(f,str) 361 else f.__getitem__) --> 362 return self._new(map(g, self)) 363 364 def filter(self, f, negate=False, **kwargs): ~/workspace/fastcore/fastcore/foundation.py in _new(self, items, *args, **kwargs) 313 @property 314 def _xtra(self): return None --> 315 def _new(self, items, *args, **kwargs): return type(self)(items, *args, use_list=None, **kwargs) 316 def __getitem__(self, idx): return self._get(idx) if is_indexer(idx) else L(self._get(idx), use_list=None) 317 def copy(self): return self._new(self.items.copy()) ~/workspace/fastcore/fastcore/foundation.py in __call__(cls, x, *args, **kwargs) 39 return x 40 ---> 41 res = super().__call__(*((x,) + args), **kwargs) 42 res._newchk = 0 43 return res ~/workspace/fastcore/fastcore/foundation.py in __init__(self, items, use_list, match, *rest) 304 if items is None: items = [] 305 if (use_list is not None) or not _is_array(items): --> 306 items = list(items) if use_list else _listify(items) 307 if match is not None: 308 if is_coll(match): match = len(match) ~/workspace/fastcore/fastcore/foundation.py in _listify(o) 240 if isinstance(o, list): return o 241 if isinstance(o, str) or _is_array(o): return [o] --> 242 if is_iter(o): return list(o) 243 return [o] 244 ~/workspace/fastcore/fastcore/foundation.py in __call__(self, *args, **kwargs) 206 if isinstance(v,_Arg): kwargs[k] = args.pop(v.i) 207 fargs = [args[x.i] if isinstance(x, _Arg) else x for x in self.pargs] + args[self.maxi+1:] --> 208 return self.fn(*fargs, **kwargs) 209 210 # Cell ~/workspace/fastai2/fastai2/learner.py in _call_one(self, event_name) 109 def _call_one(self, event_name): 110 assert hasattr(event, event_name) --> 111 [cb(event_name) for cb in sort_by_run(self.cbs)] 112 113 def _bn_bias_state(self, with_bias): return bn_bias_params(self.model, with_bias).map(self.opt.state) ~/workspace/fastai2/fastai2/learner.py in (.0) 109 def _call_one(self, event_name): 110 assert hasattr(event, event_name) --> 111 [cb(event_name) for cb in sort_by_run(self.cbs)] 112 113 def _bn_bias_state(self, with_bias): return bn_bias_params(self.model, with_bias).map(self.opt.state) ~/workspace/fastai2/fastai2/callback/core.py in __call__(self, event_name) 21 _run = (event_name not in _inner_loop or (self.run_train and getattr(self, 'training', True)) or 22 (self.run_valid and not getattr(self, 'training', False))) ---> 23 if self.run and _run: getattr(self, event_name, noop)() 24 if event_name=='after_fit': self.run=True #Reset self.run to True at each end of fit 25 ~/workspace/fastai2/fastai2/callback/fp16.py in after_fit(self) 120 121 def after_fit(self): --> 122 _copy_state(self.learn.opt, self.master_pgs, self.model_pgs) 123 self.learn.opt.param_groups = self.old_pgs 124 delattr(self, "master_pgs") ~/workspace/fastai2/fastai2/callback/fp16.py in _copy_state(opt, pgs1, pgs2) 58 for pg1,pg2 in zip(pgs1, pgs2): 59 for p1,p2 in zip(pg1, pg2): ---> 60 opt.state[p2] = copy_clone(opt.state[p1]) 61 del opt.state[p1] 62 KeyError: Parameter containing: tensor([[ 1.3822e-04, -1.7729e-04, 6.5048e-05, ..., 3.4043e-05, -1.2695e-04, 2.1729e-05], [-1.2790e-06, -2.8387e-06, 1.1174e-06, ..., -5.1155e-07, 2.3744e-06, 3.4990e-06], [-3.5445e-04, -3.1944e-04, 2.0357e-04, ..., -2.1868e-04, -5.3635e-04, 1.5944e-04], ..., [ 4.6585e-05, 7.3835e-05, 3.0947e-05, ..., -8.5723e-05, -3.0673e-05, -5.1899e-05], [-1.4656e-06, -3.2445e-06, 1.2902e-06, ..., -5.9240e-07, 2.7339e-06, 4.0058e-06], [-1.4656e-06, -3.2445e-06, 1.2902e-06, ..., -5.9240e-07, 2.7339e-06, 4.0058e-06]], device='cuda:0', requires_grad=True)`

sgugger commented 4 years ago

I need more than the lr find line of code to be able to reproduce, running lr_find on a new text classifier works perfectly fine on my side.

lvaleriu commented 4 years ago

I understand very well your point and this is a real issue indeed. Different things come to my mind:

The experiment I’m doing is a very close copy of the imdb tutorial/example from the notebook. It is mainly a dataframe with text and label columns. And as I said it worked during the day and broke after some git updates. I was hoping that the stacktrace ans this « special » error combined with the recent history of commits would help you point quickly the error. If this is not the case I can see different options:
Give you the dataset (though I’m sure you won’t find any error. It worked on my side too recently)
Give you the exact environment installation (official pytorch docker image tag + apt/pip installs). I’m always working in a docker container and I rebuild images each week to make sure my environment is reproducible. That is a good thing but makes me encounter many temporary issues (pytorch related - RNN issue with 1.4.0, torchvision related - vision import error on pytorch 1.3.0, etc)
Give you a private access on my server so that you can « observe » the issue (I never work on a local machine).

What are your thoughts about this? I also understand of course that you are very busy and don’t have much time to dig deeply in each issue. This is also the role of the community to assist you.

sgugger commented 4 years ago

The first thing that could help is to try to reproduce this bug inside one of the fastai2 notebooks, since anyone can then have access to the dataset and reproduce (and we can know for sure it's an environment problem if different people don't all have the bug). Then make sure you have the latest fastai and fastcore installed as the library still evolves quickly.

sgugger commented 4 years ago

Closing the issue while waiting for a reproducer, as there is nothing I can do on this right now. Please reopen with an example if needed.

fastai / fastai2

KeyError on learn.lr_find() #121