Helsinki-NLP / Opus-MT

Open neural machine translation models and web services
MIT License
604 stars 71 forks source link

Fine tuning opus nmt ar-en using my own dataset #77

Open theamato opened 1 year ago

theamato commented 1 year ago

Hi,

I want to fine tune the opus nt ar-en model using my own dataset, but I'm not sure what type of files my training data should be in? In the huggingface Marian tutorial (https://huggingface.co/docs/transformers/model_doc/marian) they just pass in lists of sentences, but I also read somewhere that I'm supposed to preprocess the data with Sentencepiece first. Or is sentencepiece "built in" into the arian tokenizer? All help is much appreciated.

jorgtied commented 1 year ago

I do fine-tuning directly with MarianNMT. Maybe you could ask at the transformers git repository how to do finetuning with their library? If you use OPUS-MT models and marian-nmt then you would need the subword tokenisation on the fine-tuning data as well.