CUDA error: device-side assert triggered when trying to run the Train notebook

abdallah197 commented 4 years ago

Hi I have run into an error trying to replicate the train.ipynb notebook for the music transformer I have installed the library using the instructions in the repo and tried to run the the notebook. the error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-13-495233eaf2b4> in <module>
----> 1 learn.fit_one_cycle(4)

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
     21     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
     22                                        final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
---> 23     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     24 
     25 def fit_fc(learn:Learner, tot_epochs:int=1, lr:float=defaults.lr,  moms:Tuple[float,float]=(0.95,0.85), start_pct:float=0.72,

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    198         else: self.opt.lr,self.opt.wd = lr,wd
    199         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
--> 200         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    201 
    202     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
     99             for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
    100                 xb, yb = cb_handler.on_batch_begin(xb, yb)
--> 101                 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
    102                 if cb_handler.on_batch_end(loss): break
    103 

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     24     if not is_listy(xb): xb = [xb]
     25     if not is_listy(yb): yb = [yb]
---> 26     out = model(*xb)
     27     out = cb_handler.on_loss_begin(out)
     28 

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     98     def forward(self, input):
     99         for module in self:
--> 100             input = module(input)
    101         return input
    102 

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

/GW/Health-Corpus/work/nn/musicautobot/musicautobot/music_transformer/model.py in forward(self, x)
     33         seq_len = m_len + x_len
     34 
---> 35         mask = rand_window_mask(x_len, m_len, inp.device, max_size=self.mask_steps, is_eval=not self.training) if self.mask else None
     36         if m_len == 0: mask[...,0,0] = 0
     37         #[None,:,:None] for einsum implementation of attention

/GW/Health-Corpus/work/nn/musicautobot/musicautobot/utils/attention_mask.py in rand_window_mask(x_len, m_len, device, max_size, p, is_eval)
     15         win_size,k = (1,1)
     16     else: win_size,k = (np.random.randint(0,max_size)+1,0)
---> 17     return window_mask(x_len, device, m_len, size=(win_size,k))
     18 
     19 def lm_mask(x_len, device):

/GW/Health-Corpus/work/nn/musicautobot/musicautobot/utils/attention_mask.py in window_mask(x_len, device, m_len, size)
      6     mem_mask = torch.zeros((x_len,m_len), device=device)
      7     tri_mask = torch.triu(torch.ones((x_len//win_size+1,x_len//win_size+1), device=device),diagonal=k)
----> 8     window_mask = tri_mask.repeat_interleave(win_size,dim=0).repeat_interleave(win_size,dim=1)[:x_len,:x_len]
      9     if x_len: window_mask[...,0] = 0 # Always allowing first index to see. Otherwise you'll get NaN loss
     10     mask = torch.cat((mem_mask, window_mask), dim=1)[None,None]

RuntimeError: CUDA error: device-side assert triggered

bearpelican commented 4 years ago

Can you re-run the notebook with this line at the beginning? os.environ['CUDA_LAUNCH_BLOCKING'] = '1' It'll give a more informative error than the RuntimeError.

Also, does this happen on the first batch or after training for a few batches?

abdallah197 commented 4 years ago

@bearpelican that's the error after i use os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-6-495233eaf2b4> in <module>
----> 1 learn.fit_one_cycle(4)

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/fastai/train.py in fit_one_cycle(learn, cyc_len, max_lr, moms, div_factor, pct_start, final_div, wd, callbacks, tot_epochs, start_epoch)
     21     callbacks.append(OneCycleScheduler(learn, max_lr, moms=moms, div_factor=div_factor, pct_start=pct_start,
     22                                        final_div=final_div, tot_epochs=tot_epochs, start_epoch=start_epoch))
---> 23     learn.fit(cyc_len, max_lr, wd=wd, callbacks=callbacks)
     24 
     25 def fit_fc(learn:Learner, tot_epochs:int=1, lr:float=defaults.lr,  moms:Tuple[float,float]=(0.95,0.85), start_pct:float=0.72,

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    198         else: self.opt.lr,self.opt.wd = lr,wd
    199         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
--> 200         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    201 
    202     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
     99             for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
    100                 xb, yb = cb_handler.on_batch_begin(xb, yb)
--> 101                 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
    102                 if cb_handler.on_batch_end(loss): break
    103 

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     24     if not is_listy(xb): xb = [xb]
     25     if not is_listy(yb): yb = [yb]
---> 26     out = model(*xb)
     27     out = cb_handler.on_loss_begin(out)
     28 

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     98     def forward(self, input):
     99         for module in self:
--> 100             input = module(input)
    101         return input
    102 

~/anaconda3/envs/musicautobot/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

/GW/Health-Corpus/work/nn/musicautobot/musicautobot/music_transformer/model.py in forward(self, x)
     29 
     30         bs,x_len = x.size()
---> 31         inp = self.drop_emb(self.encoder(x) + benc) #.mul_(self.d_model ** 0.5)
     32         m_len = self.hidden[0].size(1) if hasattr(self, 'hidden') and len(self.hidden[0].size()) > 1 else 0
     33         seq_len = m_len + x_len

RuntimeError: CUDA error: device-side assert triggered

abdallah197 commented 4 years ago

@bearpelican A side note, the taining start normally an crashes eveytime in 20% of the first epoch

bearpelican commented 4 years ago

Looks like it's failing on the encoder. That usually happens when you have a token that is out of range of the vocab/embedding size. Are you training on custom data?

Try looping through your data to make sure tokens are within range

for i in data.train_ds:
    assert i[0].data.max() < len(learn.data.vocab)

abdallah197 commented 4 years ago

it returns assertion error. the data that was used was the lakh midi files dataset. what would be a suggusted fix in this situation? to clip the tokens that are larger than the embedding size? which is 312 when printed, or there's a way to extend the embedding size?

bearpelican commented 4 years ago

Be default, the tokens should be clipped by duration. So I'm not sure why you are getting an out of bounds error. Have you tried checking whether the data was encoded correctly? data.trai_ds[idx][0].play(). Where idx is the index of the file that breaks the assertion error. If the playback doesn't sound right, then something must be off.

One way as you suggested is to increase the embedding length.

Currently the embedding length is calculated from the vocab length: model = get_language_model(arch, len(data.vocab.itos), config=config, drop_mult=drop_mult)

To handle longer note durations, you can increase the default DUR_SIZE, and the vocab will adjust accordingly.

Unfortunately these settings are hardcoded at the moment.

abdallah197 commented 4 years ago

@bearpelican it seems some of the Midi files were had unusual embedding length, one solution that worked was to eliminate them. the other was to run the preprocessing notebook, although am not sure about the fixes that were done there

bearpelican / musicautobot

CUDA error: device-side assert triggered when trying to run the Train notebook #21