Open theamato opened 1 year ago
I do fine-tuning directly with MarianNMT. Maybe you could ask at the transformers git repository how to do finetuning with their library? If you use OPUS-MT models and marian-nmt then you would need the subword tokenisation on the fine-tuning data as well.
Hi,
I want to fine tune the opus nt ar-en model using my own dataset, but I'm not sure what type of files my training data should be in? In the huggingface Marian tutorial (https://huggingface.co/docs/transformers/model_doc/marian) they just pass in lists of sentences, but I also read somewhere that I'm supposed to preprocess the data with Sentencepiece first. Or is sentencepiece "built in" into the arian tokenizer? All help is much appreciated.