Closed Dicky35 closed 2 years ago
What's more, if I want to train a model like gpt2 on chinese dataset, I need to specify a chinese vocab and tokenizer for chinese. So what parameters should I set in the command?
For the second question,
Download the GPT-2 model (or your own GPT-2 model) provided from Hugging Face (https://huggingface.co/gpt2/tree/main), including config.json
, merges.txt
, pytorch_model.bin
, tokenizer.json
and vocab.json
. Then put them in a folder at the same level as textbox
, such as pretrained_model/gpt2
.
After downloading, you just need to run the command:
python run_textbox.py --model=GPT2 --dataset=COCO \
--pretrained_model_path=pretrained_model/gpt2
For the first question, we have fix the bug. Thanks for your report.
Thanks for your reply! But if I try the GPT2 on COCO, it turns wrong because there is not a key named source_text in COCO.
Traceback (most recent call last):
File "run_textbox.py", line 18, in <module>
run_textbox(model=args.model, dataset=args.dataset, config_file_list=config_file_list, config_dict={})
File "/home/LAB/TextBox/textbox/quick_start/quick_start.py", line 82, in run_textbox
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data, saved=saved)
File "/home/LAB/TextBox/textbox/trainer/trainer.py", line 337, in fit
train_loss = self._train_epoch(train_data, epoch_idx)
File "/home/LAB/TextBox/textbox/trainer/trainer.py", line 181, in _train_epoch
losses = self.model(data, epoch_idx=epoch_idx)
File "/home/LAB/anaconda3/envs/textbox/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/LAB/TextBox/textbox/model/Seq2Seq/transformers.py", line 422, in forward
inputs = self._generate_default_inputs(corpus)
File "/home/LAB/TextBox/textbox/model/Seq2Seq/transformers.py", line 298, in _generate_default_inputs
source_text = corpus['source_text']
KeyError: 'source_text'
And I also noticed that there seems to be no experiments or examples on CCPC in readme.md. Does it currently support Chinese generation? I have tried python run_textbox.py --model=GPT2 --dataset=CCPC --pretrained_model_path=pretrained_model/gpt2
. It returns all bleu score of 0, because the GPT2 vocab from huggingface is in English. So how should I implement the text generation task with the CCPC dataset?
The current TextBox is not stable, some bugs will be fixed in the next version. So, could you tell me what task do you want to do with gpt2? We can offer you more direct help.
Similar to Chinese abstract generation. I would like to see whether I can run some baseline models such as bart, xlnet, gpt2, with textbox.
OK, this is a seq2seq task, you should prepare your data into six files: train.src/tgt
, valid.src/tgt
and test.src/tgt
.
Then find the Chinese pretrained model, which is pretrained using Chinese corpus (so the tokenizer can tokenize Chinese text). Then run the command:
python run_textbox.py --model=XXX --dataset=YYYY --pretrained_model_path=pretrained_model/xxx
By the way, I consider xlnet and gpt2 is not very suitable for seq2seq. Maybe the Chinese bart or cpt is more suitable for seq2seq task. But I am not sure whether it is compatible with textbox.
Thanks! I will try it. By the way, I see that you provide a Chinese poetry generation dataset called CCPC. Have you trained any model on this dataset using the textbox framework?
We use CVAE to perform Chinese poetry generation, while we do not suggest you to test that, because both the model and dataset are not suitable for your current task.
OK, Thank you for all your assistance!
For the Quick Start in readme.md, I have tried
python run_textbox.py
, it works and return bleu score. But when I triedpython run_textbox.py --rnn_type=lstm --max_vocab_size=4000
, it shows:Can you tell me how to solve it?