UBC-NLP / araT5

AraT5: Text-to-Text Transformers for Arabic Language Understanding
84 stars 18 forks source link

How to Fine Tune the translation task #5

Open OWaheed opened 2 years ago

OWaheed commented 2 years ago

I need to fine tune the translation task should i prepare the data in specific format? and how to fine tune your model for that task?

salma-elshafey commented 2 years ago

You can put your parallel data in a csv file separated by tabs. Assuming the source sentence column is called 'input_text' and the target sentence column is called 'target_text', the fine-tuning script shall look similar to this:

!python araT5/examples/run_trainier_seq2seq_huggingface.py \ --learning_rate 5e-5 \ --max_target_length 128 --max_source_length 128 \ --per_device_train_batch_size 8 --per_device_eval_batch_size 8 \ --model_name_or_path "UBC-NLP/AraT5-msa-small" \ --source_lang "ar_AR" --target_lang "en_XX" \ --output_dir "AraT5-translation" --overwrite_output_dir \ --num_train_epochs 5 \ --train_file "train.tsv" \ --validation_file "valid.tsv" \ --test_file "test.tsv" \ --task "translation" --text_column "input_text" --summary_column "target_text" \ --load_best_model_at_end --metric_for_best_model "eval_bleu" --greater_is_better True \ --evaluation_strategy epoch --logging_strategy epoch --predict_with_generate \ --do_train --do_eval

hust-kevin commented 2 years ago

@salma-elshafey @Nagoudi how to use araT5 do machine translation? what's the script of infer

BarahFazili commented 2 years ago

@hust-kevin You simply add the argument --do_predict to record the predictions(test_preds_seq2seq.txt) for the provided test_file.

ss8319 commented 2 years ago

@hust-kevin @BarahFazili If we were to just run inferencing( eg provide a dialectal arabic document to AraT5 and obtain english translation) without fine-tuning, could we reuse this script python araT5/examples/run_trainier_seq2seq_huggingface.py?

Do I need to specify -- source lang if I am using an dialectal arabic as input? For instance arz for Egyptian Arabic which was part of the training dataset? What should my source lang tag be if I am running inferencing for another dialectal arabic not part of the training set for eg. acw-Hijazi Arabic.

I am not sure if the way I am running inferencing is correct. I am using pipeline(). Is there an ideal way to run inferencing?

` import datasets from transformers import pipeline from transformers.pipelines.pt_utils import KeyDataset from tqdm.auto import tqdm from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-msa-base") tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-msa-base") pipe=pipeline("translation_arz_to_en", model=model, tokenizer=tokenizer, max_length=60) from datasets import Dataset import pandas as pd

with open('/content/gdrive/Shareddrives/Gutenberg/MT/experiments/HuggingFace/AraT5-msa-base-acw-v2-en_JHN/acw_pred.txt','w',encoding='utf-8') as f: for line in src_text: for out in tqdm(pipe(line)): for value in out.values(): #out is a dict, write takes str f.write('%s\n' %value) `