dsindex / ntagger

reference pytorch code for named entity tagging
86 stars 13 forks source link

RuntimeError: shape '[801, 2, 400]' is invalid for input of size 144000 #11

Closed appledora closed 2 years ago

appledora commented 2 years ago

Hi I keep getting this error while running this on my own dataset :

  File "/content/drive/MyDrive/ntagger/model/model.py", line 1102, in forward
    lstm_out, _ = torch.nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True, total_length=self.seq_size)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/utils/rnn.py", line 315, in pad_packed_sequence
    sequence.data, sequence.batch_sizes, batch_first, padding_value, max_seq_length)
RuntimeError: shape '[801, 2, 400]' is invalid for input of size 144000

My config-bert.json looks like this :

{
    "emb_class": "bert",
    "enc_class": "bilstm",
    "n_ctx": 879,
    "lowercase": true,
    "token_emb_dim": 300,
    "pad_token": "<pad>",
    "pad_token_id": 0,
    "unk_token": "<unk>",
    "unk_token_id": 1,
    "dsa_num_attentions": 4,
    "dsa_dim": 300,
    "dsa_r": 2,
    "pos_emb_dim": 100,
    "pad_pos": "<pad>",
    "pad_pos_id": 0,
    "char_n_ctx": 50,
    "char_vocab_size": 262,
    "char_padding_idx": 261,
    "char_emb_dim": 25,
    "char_num_filters": 30,
    "char_kernel_sizes": [3, 9],
    "dropout": 0.1,
    "lstm_hidden_dim": 200,
    "lstm_num_layers": 2,
    "lstm_dropout": 0.0,
    "mha_num_attentions": 8,
    "pad_label": "<pad>",
    "pad_label_id": 0,
    "default_label": "O",
    "prev_context_size": 64
}

My training script looks like this :

CUDA_LAUNCH_BLOCKING=1 \
python train.py \
--config=configs/config-bert.json \
--data_dir=data/bangla \
--save_path=pytorch-model-bert.pt \
--bert_model_name_or_path=csebuetnlp/banglabert \
--bert_use_subword_pooling \
--batch_size=2 \
--eval_batch_size=8 \
--lr=5e-5 \
--epoch=10 \
--use_mha \
--use_crf \
--patience 4

Could you kindly tell me what I am doing wrong?

dsindex commented 2 years ago

n_ctx must be less than 512 for emb_class=bert. try to run preprocess.py again.

input tensor of pad_packed_sequence() should be (batch_size, seq_len, emb_dim). could you check the shape?

appledora commented 2 years ago

Ah yes, I was thinking about the 512 tokens limits on BERT. But it seems my dataset contains max_sequence_length of 879. Would you suggest removing them or is there any work around? my batch_size is 2 , token_emb_dimension is 300. (as per the config json and i set it assuming, it actually meant the dimension of the GloVe embedding I used? ) I have printed out the dimensions here,

image

And it gives me :

embed out torch.Size([2, 180, 768])
seq size 879

Additionally, @dsindex could you suggest to me any reference for the config file keys? Thanks :smiley:

dsindex commented 2 years ago

yes, as you mentioned, 300 is the dim of GloVe embeddings. i suggest n_ctx is 512 for training. for inference, you may need to split input upto 512 and combine results in post processing.

another method you could try is using BigBird for long sequences.

appledora commented 2 years ago

Okay, so something interesting happened. I double checked all the sentences in train test and validation files. None of them exceeds, 182. However, I got this error :

Traceback (most recent call last):
  File "/content/drive/MyDrive/ntagger/train.py", line 759, in <module>
    main()
  File "/content/drive/MyDrive/ntagger/train.py", line 756, in main
    train(args)
  File "/content/drive/MyDrive/ntagger/train.py", line 591, in train
    eval_loss, eval_f1, best_eval_f1 = train_epoch(model, config, train_loader, valid_loader, epoch_i, best_eval_f1)
  File "/content/drive/MyDrive/ntagger/train.py", line 104, in train_epoch
    logits, prediction = model(x, freeze_bert=freeze_bert)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/drive/MyDrive/ntagger/model/model.py", line 1102, in forward
    lstm_out, _ = torch.nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True, total_length=self.seq_size)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/utils/rnn.py", line 312, in pad_packed_sequence
    .format(total_length, max_seq_length))
ValueError: Expected total_length to be at least the length of the longest sequence in input, but got total_length=183 and max sequence length being 879
appledora commented 2 years ago

For ModelClass I am using the BertLSTMCRF class from model.py. However I had to make some changes to it since my dataset doesn't contain any POS tags, so I had to remove all its mentions there. I wonder whether that is causing the problem?

dsindex commented 2 years ago

according to error messages, you got the error in https://github.com/dsindex/ntagger/blob/master/model/model.py#L784 https://github.com/pytorch/pytorch/blob/master/torch/nn/utils/rnn.py#L308 “ got total_length=183 and max sequence length being 879”

i guess n_ctx in your config file stil not changed or you did not run preprocess.py again for n_ctx.

appledora commented 2 years ago

I double checked the config file this time, and then ran preprocess.py on it. Getting the same error for a different dimension now :3

File "/content/drive/MyDrive/ntagger/model/model.py", line 782, in forward
    lstm_out, _ = torch.nn.utils.rnn.pad_packed_sequence(lstm_out, batch_first=True, total_length=self.seq_size)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/utils/rnn.py", line 312, in pad_packed_sequence
    .format(total_length, max_seq_length))
ValueError: Expected total_length to be at least the length of the longest sequence in input, but got total_length=182 and max sequence length being 1815

Really confused about the arbitrarily large value of max sequence length . This time I used the default config-bert.json provided in the repo, only changing the n_ctx value to 182 and setting lowercase to FALSE.

appledora commented 2 years ago

I have solved this error :'3 I was being dumb and wasn't using --bert_use_word_embedding flag with --bert_use_subword_pooling . For now this issue is resolved. However, facing issues in

File "/content/drive/MyDrive/ntagger/model/model.py", line 752, in forward
    token_ids = x[base_idx+2]
IndexError: list index out of range

It is probably because my data doesn't have any PoS tags in it. Do you have any suggestions regarding this? I have previously modified util-bert.py to fix the PoS issue during preprocess and it worked properly. But I am kind of at a lost in the model.py file :sweat_smile:

dsindex commented 2 years ago

i think if you remove POS feature then you need to modify base_idx to '4'. https://github.com/dsindex/ntagger/blob/master/model/model.py#L726

original input index looks like:

        # x[0,1,2] : [batch_size, seq_size], input_ids / input_mask / segment_ids == input_ids / attention_mask / token_type_ids
        # x[3] :     [batch_size, seq_size], pos_ids
        # x[4] :     [batch_size, seq_size, char_n_ctx], char_ids

        # with --bert_use_doc_context
        # x[5] :     [batch_size, seq_size], doc2sent_idx
        # x[6] :     [batch_size, seq_size], doc2sent_mask
        # x[7] :     [batch_size, seq_size], word2token_idx  with --bert_use_subword_pooling
        # x[8] :     [batch_size, seq_size], word2token_mask with --bert_use_subword_pooling
        # x[9] :     [batch_size, seq_size], word_ids        with --bert_use_word_embedding

        # without --bert_use_doc_context
        # x[5] :     [batch_size, seq_size], word2token_idx  with --bert_use_subword_pooling
        # x[6] :     [batch_size, seq_size], word2token_mask with --bert_use_subword_pooling
        # x[7] :     [batch_size, seq_size], word_ids        with --bert_use_word_embedding
appledora commented 2 years ago

Hello @dsindex !! Sorry for the late update, I was struggling with managing a large enough resource to run this repo faster for the last week. Finally, ran it successfully after making the modifications by following your suggestions. It works now!! Thank you for your ammzing patience and cooperation!! You have created an elegant repo! You can close this issue now.