ThilinaRajapakse / simpletransformers

Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
https://simpletransformers.ai/
Apache License 2.0
4.12k stars 727 forks source link

T5 problem in japanese text #1374

Closed ruesei-koseki closed 2 years ago

ruesei-koseki commented 2 years ago

Describe the bug If I try to make a T5 model learn Japanese, I will get garbled results when you predict.

To Reproduce This is the source code.

import logging

import pandas as pd
from simpletransformers.t5 import T5Model

# ログの設定
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# 学習データ
train_data = [
    ["chat", "こんにちは!", "こんにちは、調子はどうですか?"],
    ["chat", "調子はいいですよ", "そうか"],
    ["convert", "one", "1"],
    ["chat", "How are you?", "I'm fun, and you?"],
]
train_df = pd.DataFrame(train_data, columns=["prefix", "input_text", "target_text"])

# モデルの作成
model_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "max_seq_length": 10,
    "train_batch_size": 2,
    "num_train_epochs": 5,
}
model = T5Model("t5", "outputs/", args=model_args, use_cuda=False)

# 学習
model.train_model(train_df)

# 予測
print(model.predict(["convert: one"]))
print(model.predict(["chat: How are you?"]))
print(model.predict(["chat: こんにちは!"]))

outputs: ['1'] ['I'm fun, and you?'] ['?']

Expected behavior Returns a non-garbled Japanese string.

Screenshots image

Desktop (please complete the following information):

Additional context Is this my lack of skill? Or is it a bug? Please help me. Thank you.

ThilinaRajapakse commented 2 years ago

T5 doesn't support Japanese, that is probably why you are getting this issue. You can try using mT5 instead.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.