OpenNMT / OpenNMT-tf

Neural machine translation and sequence learning using TensorFlow
https://opennmt.net/
MIT License
1.46k stars 392 forks source link

Error when inferencing with gpt. #943

Closed songxxzp closed 2 years ago

songxxzp commented 2 years ago

I met an error when I'm try to do text generation with trained GPT2Small. The command:

CUDA_VISIBLE_DEVICES=9 onmt-main --model_type GPT2Small --config AASC-sGPT-2.yml --auto_config infer --features_file front_translation/source.txt > front_translation/target.txt`

The Error:

`2022-05-11 20:01:08.242000: I language_model.py:158] Initialized input layer: 2022-05-11 20:01:08.242000: I language_model.py:158] - vocabulary size: 32001 2022-05-11 20:01:08.242000: I language_model.py:158] - special tokens: BOS=no, EOS=no 2022-05-11 20:01:08.260000: I runner.py:427] Restored checkpoint run-gpt/ckpt-1315000 Traceback (most recent call last): File "/home/junli/anaconda3/envs/py37env-gpu/bin/onmt-main", line 8, in sys.exit(main()) File "/home/junli/anaconda3/envs/py37env-gpu/lib/python3.7/site-packages/opennmt/bin/main.py", line 337, in main log_time=args.log_prediction_time, File "/home/junli/anaconda3/envs/py37env-gpu/lib/python3.7/site-packages/opennmt/runner.py", line 434, in infer prefetch_buffer_size=infer_config.get("prefetch_buffer_size"), File "/home/junli/anaconda3/envs/py37env-gpu/lib/python3.7/site-packages/opennmt/inputters/inputter.py", line 136, in make_inference_dataset prefetch_buffer_size=prefetch_buffer_size, File "/home/junli/anaconda3/envs/py37env-gpu/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2075, in apply dataset = transformation_func(self) File "/home/junli/anaconda3/envs/py37env-gpu/lib/python3.7/site-packages/opennmt/data/dataset.py", line 731, in _pipeline "length functions" % (len(output_shapes), num_length_fn) ValueError: The dataset outputs 2 parallel features, but got 1 length functions

AASC-sGPT-2.yml:

model_dir: run-gpt/ data: train_features_file: ./AASC/AASC-train.txt vocabulary: ./AASC/GPT-2.AASC.sp.vocab source_tokenization: type: OpenNMTTokenizer params: mode: none sp_model_path: ./data/GPT-2.AASC.sp.model

source.txt looks like:

Nowadays People As Internet For the new generations Researchers noticed It seems that

AASC-train.txt looks like:

Nowadays, people have become increasingly accustomed to expressing their opinions online, especially on social media such as Wechat, Tiktok and Weibo - the biggest Chinese social media network that was launched in 2009. People started to use internet slang as a new language with innovative and novel characteristics on social media. As the rapid growth of such platforms, people’s communicative behavior, language, and psychology have all been affected by the subtle influence of internet slang. Internet slang is omnipresent on the internet. For the new generations, internet slang becomes a daily communication. Researchers noticed that when young people use new words and pictograms, they tend to express a kind of humorous emotion which is difficult to understand with general language pattern. It seems that some emotions are used just for fun, self-mockery or jocosity which express an implicit humor which might be characteristic to Chinese culture.

GPT-2.AASC.sp.model is a sentencepiece model built by onmt-build-vocab, the same as GPT-2.AASC.sp.vocab.

guillaumekln commented 2 years ago

I can reproduce this error. As a workaround, you can add the following parameter in the YAML configuration:

infer:
  length_bucket_width: 0
songxxzp commented 2 years ago

Thanks, that would work!

guillaumekln commented 2 years ago

I will keep this issue open as we should this in the code.