PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.09k stars 2.93k forks source link

mBART使用时翻译不完整 Translation using mBART is not entirely done. #5768

Open holyseven opened 1 year ago

holyseven commented 1 year ago

问题描述

conda env:

paddlenlp                 2.5.2                    pypi_0    pypi
paddlepaddle-gpu          2.3.2           py37_gpu_cuda10.2_many_linux    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle

Code (almost the same as https://github.com/PaddlePaddle/PaddleNLP/blob/v2.5.2/fast_generation/samples/mbart_sample.py, except use_fast is disabled and a longer text is used.)


import paddle

from paddlenlp.transformers import MBart50Tokenizer, MBartForConditionalGeneration

# model_name = "mbart-large-50-many-to-many-mmt"
model_name = "mbart-large-50-one-to-many-mmt"

tokenizer = MBart50Tokenizer.from_pretrained(model_name, src_lang="en_XX")
model = MBartForConditionalGeneration.from_pretrained(model_name)
model.eval()

def postprocess_response(seq, bos_idx, eos_idx):
    """Post-process the decoded sequence."""
    eos_pos = len(seq) - 1
    for i, idx in enumerate(seq):
        if idx == eos_idx:
            eos_pos = i
            break
    seq = [idx for idx in seq[: eos_pos + 1] if idx != bos_idx and idx != eos_idx]
    res = tokenizer.convert_ids_to_string(seq)
    return res

bos_id = tokenizer.lang_code_to_id["zh_CN"]
eos_id = model.mbart.config["eos_token_id"]

inputs = """
The MBart model was presented in Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 
According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual corpora in many languages using the BART objective. 
mBART is one of the first methods for pretraining a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text.
To generate using the mBART-50 multilingual translation models, eos_token_id is used as the decoder_start_token_id and the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method. The following example shows how to translate between Hindi to French and Arabic to English using the facebook/mbart-50-large-many-to-many checkpoint.
"""

input_ids = tokenizer(inputs)["input_ids"]
input_ids = paddle.to_tensor(input_ids, dtype="int32").unsqueeze(0)

outputs, _ = model.generate(
    input_ids=input_ids,
    forced_bos_token_id=bos_id,
    # decode_strategy="sampling",
    # temperature=1.0,
    # top_k=3,
    # top_p=0.9,
    decode_strategy="beam_search",
    num_beams=4,
    # decode_strategy="greedy_search",
    max_length=5000,
    use_fast=False,
)

result = postprocess_response(outputs[0].numpy().tolist(), bos_id, eos_id)

print("Model input:", inputs)

print("Result:", result)

Model input: The MBart model was presented in Multilingual Denoising Pre-training for Neural Machine Translation by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. According to the abstract, MBART is a sequence-to-sequence denoising auto-encoder pretrained on large-scale monolingual corpora in many languages using the BART objective. mBART is one of the first methods for pretraining a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. To generate using the mBART-50 multilingual translation models, eos_token_id is used as the decoder_start_token_id and the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id parameter to the generate method. The following example shows how to translate between Hindi to French and Arabic to English using the facebook/mbart-50-large-many-to-many checkpoint.

Result: MBART模型是由Yinhan Liu、Jiatao Gu、Naman Goyal、Xian Li、Serghe Edunov、Marjan Ghazvininejad、Mike Lewis、Luke Zettlemoyer在神经机翻译的多语言预备训练中提出的。根据抽象,MBART是一个基于BART目标的大规模单语体序列序列符号化自动编码器。 MBART是通过在多个语言中标记完整的文本来预备完整的序列序列模型的第一个方法,而以前的方法只专注于编码器、解码器或文本部分的重建。

gongel commented 1 year ago

这个是模型的原因哦,如果要更好的效果,需要去微调模型。

holyseven commented 1 year ago

有其它英翻中的模型推荐吗?

gongel commented 1 year ago

“mbart-large-50-many-to-many-mmt” 有试过吗

holyseven commented 1 year ago

暂时没有,我试试