Shivanandroy / simpleT5

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.
MIT License
386 stars 62 forks source link

Unicode Charecter training issue #40

Open rahat10120141 opened 2 years ago

rahat10120141 commented 2 years ago

I tried to train My model for translating English to Bengali. After Training when I run the code, The output is not Unicode Bengali character.

I Eat Rice (eng)=> আমি ভাত খাই (Bn)

this type of data is input to the model while training. After complete, when I tested the model by inputting "I Eat Rice" I was expecting "আমি ভাত খাই" as output. But instead of this, the model gave me "Ich esse Reis." I dont know what kind of language is this. Its not related to bengali.

rahat10120141 commented 2 years ago

I tested the output. It was in the german language. But why its In German Language

rahat10120141 commented 2 years ago
    model = SimpleT5()
    model.from_pretrained(model_type="t5", model_name="t5-base")
    path = "D:\\Python\\Quilbot\\Dataset\\translation.csv"
    df = pd.read_csv(path, encoding='utf8',quotechar="'")
    # df.apply(lambda x: pd.lib.infer_dtype(x.values))
    # print(df)
    df = df.rename(columns={"headlines": "source_text", "text": "target_text"})
    df = df[['source_text', 'target_text']]
    # T5 model expects a task related prefix: since it is a summarization task, we will add a prefix "summarize: "
    df['source_text'] = "tn2bn: " + df['source_text']
    print(df)
    train_df, test_df = train_test_split(df, test_size=0.2)
    train_df.shape, test_df.shape
    print(train_df.shape, test_df.shape)
    model.train(train_df=train_df,
                eval_df=test_df,
                source_max_token_len=128,
                target_max_token_len=50,
                batch_size=8,
                max_epochs=3,
                use_gpu=False
                )
    model.load_model("t5", "outputs/translate", use_gpu=False)

    text_to_summarize = "translate: I eat rice."
    print(model.predict(text_to_summarize))
rahat10120141 commented 2 years ago

I have tested it with the commanding phrase: "tn2bn"

Shivanandroy commented 2 years ago

@rahat10120141 : How does your train_df looks like before feeding to model?

rahat10120141 commented 2 years ago

T5 Doesn't have an English to Bengali translation. From the beginning, it was giving me German result