How Helsinki models (in the transformers library) are trained ?

Ahmath-Gadji commented 2 years ago

Hello @jorgtied

It seems to me that there is no model to translate from french to wolof. I'm trying to do it myself by training it from scratch using the Huggingface library. I want to use the same class (MarianMT) as you did for your translation models. I'm having difficulties with this model because I don't know how to initialize the tokenizer (MarianTokenizer). It requires SentencePiece files ( a .spm extension) file but in general, SentencePiece models are stored in a ".model" extension file and I haven't seen nowhere a sentencePiece model saved in a ".spm". So could you tell me how you did initialize the tokenizer class for your models Please?

Also, I've seen tutorials teaching the process to train translation models from scratch in Hugginface, and apparently, some people are struggling with it too. So code snippets or resources that you used to train the Helsinki models (in Hugginface) are welcome too?

thank you in advance

jorgtied commented 2 years ago

I never trained any models using the HF transformers library. All models are trained with marian-nmt and then converted to pytorch to make them available from HF. You could do the same if you like and I can give you some more information about how to do that. What are the tutorials that you looked at and what are other people struggling with?

xyx361100238 commented 2 years ago

hello all： I use MARIANNMT have the same question： 1、train en-zh model according to examples/transformer 2、pre-process use jieba & bpe 3、done with train and test good 4、use convert_marian_to_pytorch.py to converted model

Q: 1、can't save model “*.spm” use sentencePiece 2、how to generate source.spm&target.spm or the steps oftrain model use MarianNMT can use in HF way（pytorch）

Thanks！

Ahmath-Gadji commented 2 years ago

Hello, Sorry for my late reply. I think the struggle is that it seems that Hugginface isn't for training models from scratch. Also in Hugginface the tutorials on translation models are based on pre-trained models (Helsinki) to fine-tune. Tips on how you did the training using the marian-nmt library are welcome!!!

sincerely

Le jeu. 30 juin 2022 à 14:14, tiedemann @.***> a écrit :

I never trained any models using the HF transformers library. All models are trained with marian-nmt and then converted to pytorch to make them available from HF. You could do the same if you like and I can give you some more information about how to do that. What are the tutorials that you looked at and what are other people struggling with?

— Reply to this email directly, view it on GitHub https://github.com/Helsinki-NLP/Opus-MT/issues/64#issuecomment-1171143979, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLORR2JASXP3PXWCTFPB3DVRWFR5ANCNFSM5ZQ3ZXKA . You are receiving this because you authored the thread.Message ID: @.***>

jorgtied commented 2 years ago

I used recipes from https://github.com/Helsinki-NLP/OPUS-MT-train but be aware that this is research code and might not work out of the box for you.

Helsinki-NLP / Opus-MT

How Helsinki models (in the transformers library) are trained ? #64