RUCAIBox / TextBox

TextBox 2.0 is a text generation library with pre-trained language models
https://github.com/RUCAIBox/TextBox
MIT License
1.08k stars 117 forks source link

Quick Start Error #182

Closed Dicky35 closed 2 years ago

Dicky35 commented 2 years ago

For the Quick Start in readme.md, I have tried python run_textbox.py , it works and return bleu score. But when I tried python run_textbox.py --rnn_type=lstm --max_vocab_size=4000, it shows:

06 Apr 10:58    INFO epoch 38 training [time: 0.76s, train loss: 2.8981]
06 Apr 10:58    INFO epoch 38 evaluating [time: 0.23s, valid_loss: 4.281664]
06 Apr 10:58    INFO valid ppl: 72.36075204659402
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 206.01it/s]
06 Apr 10:58    INFO epoch 39 training [time: 0.76s, train loss: 2.8782]
06 Apr 10:58    INFO epoch 39 evaluating [time: 0.23s, valid_loss: 4.282278]
06 Apr 10:58    INFO valid ppl: 72.40518443339685
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 156/156 [00:00<00:00, 190.98it/s]
06 Apr 10:58    INFO epoch 40 training [time: 0.82s, train loss: 2.8627]
06 Apr 10:58    INFO epoch 40 evaluating [time: 0.23s, valid_loss: 4.286773]
06 Apr 10:58    INFO valid ppl: 72.73138277179176
06 Apr 10:58    INFO Finished training, best eval result in epoch 37
06 Apr 10:58    INFO best valid loss: 4.267283218029218, best valid ppl: 71.32759063735446
06 Apr 10:58    INFO Loading model structure and parameters from saved/RNN-COCO-Apr-06-2022_10-57-51.pth
  0%|                                                                                                                                 | 0/157 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_textbox.py", line 18, in <module>
    run_textbox(model=args.model, dataset=args.dataset, config_file_list=config_file_list, config_dict={})
  File "/home/LAB/TextBox/textbox/quick_start/quick_start.py", line 90, in run_textbox
    test_result = trainer.evaluate(test_data, load_best_model=saved)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/LAB/TextBox/textbox/trainer/trainer.py", line 446, in evaluate
    generated = self.model.generate(batch_data, eval_data)
  File "/home/LAB/TextBox/textbox/model/LM/rnn.py", line 64, in generate
    outputs, hidden_states = self.decoder(decoder_input, hidden_states)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/LAB/TextBox/textbox/module/Decoder/rnn_decoder.py", line 79, in forward
    outputs, hidden_states = self.decoder(input_embeddings, hidden_states)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 689, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 634, in check_forward_args
    'Expected hidden[0] size {}, got {}')
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 226, in check_hidden_size
    raise RuntimeError(msg.format(expected_hidden_size, list(hx.size())))
RuntimeError: Expected hidden[0] size (2, 1, 128), got [1, 128]

Can you tell me how to solve it?

Dicky35 commented 2 years ago

What's more, if I want to train a model like gpt2 on chinese dataset, I need to specify a chinese vocab and tokenizer for chinese. So what parameters should I set in the command?

StevenTang1998 commented 2 years ago

For the second question,

  1. Download the GPT-2 model (or your own GPT-2 model) provided from Hugging Face (https://huggingface.co/gpt2/tree/main), including config.json, merges.txt, pytorch_model.bin, tokenizer.jsonand vocab.json. Then put them in a folder at the same level as textbox, such as pretrained_model/gpt2.

  2. After downloading, you just need to run the command:

python run_textbox.py --model=GPT2 --dataset=COCO \
                      --pretrained_model_path=pretrained_model/gpt2
StevenTang1998 commented 2 years ago

For the first question, we have fix the bug. Thanks for your report.

Dicky35 commented 2 years ago

Thanks for your reply! But if I try the GPT2 on COCO, it turns wrong because there is not a key named source_text in COCO.

Traceback (most recent call last):
  File "run_textbox.py", line 18, in <module>
    run_textbox(model=args.model, dataset=args.dataset, config_file_list=config_file_list, config_dict={})
  File "/home/LAB/TextBox/textbox/quick_start/quick_start.py", line 82, in run_textbox
    best_valid_score, best_valid_result = trainer.fit(train_data, valid_data, saved=saved)
  File "/home/LAB/TextBox/textbox/trainer/trainer.py", line 337, in fit
    train_loss = self._train_epoch(train_data, epoch_idx)
  File "/home/LAB/TextBox/textbox/trainer/trainer.py", line 181, in _train_epoch
    losses = self.model(data, epoch_idx=epoch_idx)
  File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/LAB/TextBox/textbox/model/Seq2Seq/transformers.py", line 422, in forward
    inputs = self._generate_default_inputs(corpus)
  File "/home/LAB/TextBox/textbox/model/Seq2Seq/transformers.py", line 298, in _generate_default_inputs
    source_text = corpus['source_text']
KeyError: 'source_text'

And I also noticed that there seems to be no experiments or examples on CCPC in readme.md. Does it currently support Chinese generation? I have tried python run_textbox.py --model=GPT2 --dataset=CCPC --pretrained_model_path=pretrained_model/gpt2. It returns all bleu score of 0, because the GPT2 vocab from huggingface is in English. So how should I implement the text generation task with the CCPC dataset?

StevenTang1998 commented 2 years ago

The current TextBox is not stable, some bugs will be fixed in the next version. So, could you tell me what task do you want to do with gpt2? We can offer you more direct help.

Dicky35 commented 2 years ago

Similar to Chinese abstract generation. I would like to see whether I can run some baseline models such as bart, xlnet, gpt2, with textbox.

StevenTang1998 commented 2 years ago

OK, this is a seq2seq task, you should prepare your data into six files: train.src/tgt, valid.src/tgt and test.src/tgt. Then find the Chinese pretrained model, which is pretrained using Chinese corpus (so the tokenizer can tokenize Chinese text). Then run the command:

python run_textbox.py --model=XXX --dataset=YYYY --pretrained_model_path=pretrained_model/xxx

By the way, I consider xlnet and gpt2 is not very suitable for seq2seq. Maybe the Chinese bart or cpt is more suitable for seq2seq task. But I am not sure whether it is compatible with textbox.

Dicky35 commented 2 years ago

Thanks! I will try it. By the way, I see that you provide a Chinese poetry generation dataset called CCPC. Have you trained any model on this dataset using the textbox framework?

StevenTang1998 commented 2 years ago

We use CVAE to perform Chinese poetry generation, while we do not suggest you to test that, because both the model and dataset are not suitable for your current task.

Dicky35 commented 2 years ago

OK, Thank you for all your assistance!