👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
import paddle
from paddlenlp.transformers import MBart50Tokenizer, MBartForConditionalGeneration
# model_name = "mbart-large-50-many-to-many-mmt"
model_name = "mbart-large-50-one-to-many-mmt"
tokenizer = MBart50Tokenizer.from_pretrained(model_name, src_lang="en_XX")
model = MBartForConditionalGeneration.from_pretrained(model_name)
model.eval()
def postprocess_response(seq, bos_idx, eos_idx):
"""Post-process the decoded sequence."""
eos_pos = len(seq) - 1
for i, idx in enumerate(seq):
if idx == eos_idx:
eos_pos = i
break
seq = [idx for idx in seq[: eos_pos + 1] if idx != bos_idx and idx != eos_idx]
res = tokenizer.convert_ids_to_string(seq)
return res
bos_id = tokenizer.lang_code_to_id["zh_CN"]
eos_id = model.mbart.config["eos_token_id"]
inputs = """
The MBart model was presented in Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual corpora in many languages using the BART objective.
mBART is one of the first methods for pretraining a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text.
To generate using the mBART-50 multilingual translation models, eos_token_id is used as the decoder_start_token_id and the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method. The following example shows how to translate between Hindi to French and Arabic to English using the facebook/mbart-50-large-many-to-many checkpoint.
"""
input_ids = tokenizer(inputs)["input_ids"]
input_ids = paddle.to_tensor(input_ids, dtype="int32").unsqueeze(0)
outputs, _ = model.generate(
input_ids=input_ids,
forced_bos_token_id=bos_id,
# decode_strategy="sampling",
# temperature=1.0,
# top_k=3,
# top_p=0.9,
decode_strategy="beam_search",
num_beams=4,
# decode_strategy="greedy_search",
max_length=5000,
use_fast=False,
)
result = postprocess_response(outputs[0].numpy().tolist(), bos_id, eos_id)
print("Model input:", inputs)
print("Result:", result)
Model input:
The MBart model was presented in Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual corpora in many languages using the BART objective.
mBART is one of the first methods for pretraining a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text.
To generate using the mBART-50 multilingual translation models, eos_token_id is used as the decoder_start_token_id and the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method. The following example shows how to translate between Hindi to French and Arabic to English using the facebook/mbart-50-large-many-to-many checkpoint.
问题描述
conda env:
Code (almost the same as https://github.com/PaddlePaddle/PaddleNLP/blob/v2.5.2/fast_generation/samples/mbart_sample.py, except use_fast is disabled and a longer text is used.)
Model input: The MBart model was presented in Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. To generate using the mBART-50 multilingual translation models, eos_token_id is used as the decoder_start_token_id and the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method. The following example shows how to translate between Hindi to French and Arabic to English using the facebook/mbart-50-large-many-to-many checkpoint.
Result: MBART模型是由Yinhan Liu、Jiatao Gu、Naman Goyal、Xian Li、Serghe Edunov、Marjan Ghazvininejad、Mike Lewis、Luke Zettlemoyer在神经机翻译的多语言预备训练中提出的。根据抽象,MBART是一个基于BART目标的大规模单语体序列序列符号化自动编码器。 MBART是通过在多个语言中标记完整的文本来预备完整的序列序列模型的第一个方法,而以前的方法只专注于编码器、解码器或文本部分的重建。