Closed appledora closed 2 years ago
n_ctx must be less than 512 for emb_class=bert. try to run preprocess.py again.
input tensor of pad_packed_sequence() should be (batch_size, seq_len, emb_dim). could you check the shape?
Ah yes, I was thinking about the 512 tokens limits on BERT. But it seems my dataset contains max_sequence_length
of 879. Would you suggest removing them or is there any work around?
my batch_size
is 2 , token_emb_dimension
is 300. (as per the config json and i set it assuming, it actually meant the dimension of the GloVe embedding I used? )
I have printed out the dimensions here,
And it gives me :
embed out torch.Size([2, 180, 768])
seq size 879
Additionally, @dsindex could you suggest to me any reference for the config
file keys? Thanks :smiley:
yes, as you mentioned, 300 is the dim of GloVe embeddings. i suggest n_ctx is 512 for training. for inference, you may need to split input upto 512 and combine results in post processing.
another method you could try is using BigBird for long sequences.
Okay, so something interesting happened. I double checked all the sentences in train test and validation files. None of them exceeds, 182
. However, I got this error :
Traceback (most recent call last):
File "/content/drive/MyDrive/ntagger/train.py", line 759, in <module>
main()
File "/content/drive/MyDrive/ntagger/train.py", line 756, in main
train(args)
File "/content/drive/MyDrive/ntagger/train.py", line 591, in train
eval_loss, eval_f1, best_eval_f1 = train_epoch(model, config, train_loader, valid_loader, epoch_i, best_eval_f1)
File "/content/drive/MyDrive/ntagger/train.py", line 104, in train_epoch
logits, prediction = model(x, freeze_bert=freeze_bert)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/content/drive/MyDrive/ntagger/model/model.py", line 1102, in forward
lstm_out, _ = torch.nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True, total_length=self.seq_size)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/utils/rnn.py", line 312, in pad_packed_sequence
.format(total_length, max_seq_length))
ValueError: Expected total_length to be at least the length of the longest sequence in input, but got total_length=183 and max sequence length being 879
For ModelClass
I am using the BertLSTMCRF
class from model.py
. However I had to make some changes to it since my dataset doesn't contain any POS
tags, so I had to remove all its mentions there. I wonder whether that is causing the problem?
according to error messages, you got the error in https://github.com/dsindex/ntagger/blob/master/model/model.py#L784 https://github.com/pytorch/pytorch/blob/master/torch/nn/utils/rnn.py#L308 “ got total_length=183 and max sequence length being 879”
i guess n_ctx in your config file stil not changed or you did not run preprocess.py again for n_ctx.
I double checked the config file this time, and then ran preprocess.py on it. Getting the same error for a different dimension now :3
File "/content/drive/MyDrive/ntagger/model/model.py", line 782, in forward
lstm_out, _ = torch.nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True, total_length=self.seq_size)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/utils/rnn.py", line 312, in pad_packed_sequence
.format(total_length, max_seq_length))
ValueError: Expected total_length to be at least the length of the longest sequence in input, but got total_length=182 and max sequence length being 1815
Really confused about the arbitrarily large value of max sequence length
.
This time I used the default config-bert.json
provided in the repo, only changing the n_ctx
value to 182
and setting lowercase
to FALSE
.
I have solved this error :'3 I was being dumb and wasn't using --bert_use_word_embedding
flag with --bert_use_subword_pooling
. For now this issue is resolved.
However, facing issues in
File "/content/drive/MyDrive/ntagger/model/model.py", line 752, in forward
token_ids = x[base_idx+2]
IndexError: list index out of range
It is probably because my data doesn't have any PoS tags in it. Do you have any suggestions regarding this? I have previously modified util-bert.py
to fix the PoS issue during preprocess and it worked properly. But I am kind of at a lost in the model.py
file :sweat_smile:
i think if you remove POS feature then you need to modify base_idx to '4'. https://github.com/dsindex/ntagger/blob/master/model/model.py#L726
original input index looks like:
# x[0,1,2] : [batch_size, seq_size], input_ids / input_mask / segment_ids == input_ids / attention_mask / token_type_ids
# x[3] : [batch_size, seq_size], pos_ids
# x[4] : [batch_size, seq_size, char_n_ctx], char_ids
# with --bert_use_doc_context
# x[5] : [batch_size, seq_size], doc2sent_idx
# x[6] : [batch_size, seq_size], doc2sent_mask
# x[7] : [batch_size, seq_size], word2token_idx with --bert_use_subword_pooling
# x[8] : [batch_size, seq_size], word2token_mask with --bert_use_subword_pooling
# x[9] : [batch_size, seq_size], word_ids with --bert_use_word_embedding
# without --bert_use_doc_context
# x[5] : [batch_size, seq_size], word2token_idx with --bert_use_subword_pooling
# x[6] : [batch_size, seq_size], word2token_mask with --bert_use_subword_pooling
# x[7] : [batch_size, seq_size], word_ids with --bert_use_word_embedding
Hello @dsindex !! Sorry for the late update, I was struggling with managing a large enough resource to run this repo faster for the last week. Finally, ran it successfully after making the modifications by following your suggestions. It works now!! Thank you for your ammzing patience and cooperation!! You have created an elegant repo! You can close this issue now.
Hi I keep getting this error while running this on my own dataset :
My
config-bert.json
looks like this :My training script looks like this :
Could you kindly tell me what I am doing wrong?