NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.92k stars 2.47k forks source link

My i just use T5 in dialogue without pretrain a language model #4285

Closed 520jefferson closed 2 years ago

520jefferson commented 2 years ago

I want to distill a big model to t5-base model, and the big model use bpe not sentencepiece, So the tokenizer should be load the bpe codes and the vocabs is differenct from the origin t5-base model. Therefore i am not sure about two things. 1, whether the t5 can be train in dialogue without pretrain which treat the t5 like transformer, and i find this config (https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/dialogue/conf/dialogue_config.yaml) , is it meet my demand? 2, how to set the tokenizer to use bpe codes?

tanmoyio commented 2 years ago

@okuchaiev if you haven't started would you mind assigning it to me? I want to give it a try cc @520jefferson

520jefferson commented 2 years ago

@tanmoyio is there anny progress?

MaximumEntropy commented 2 years ago

I'm not sure I understand this bit "use bpe not sentencepiece". Sentencepiece is a library while BPE is an algorithm. Sentecepiece implements the BPE algorithm and others. To train your own BPE tokenizer in sentencepiece, you can do

spm_train --input=<input> --model_prefix=<model_name> --vocab_size=32000 --character_coverage=0.9995 --model_type=bpe

All of the pre-trained T5 models from Google use BPE and it's probably best to stick to their provided tokenizer instead of trying to re-train your own.

520jefferson commented 2 years ago

Hey @MaximumEntropy

I want to distill a big model (pytorch version) to t5 model (considering the FasterTransformer Backend https://github.com/triton-inference-server/fastertransformer_backend has provide origin T5 (not t5.1) triton backend optimization , this reasoning optimizaiton will be conducive to carrying more online traffic) , why i don't use transformer as the student model because i haven't find the pytorch version transformer with reasoning optimization and combining with triton.

And the big model use bpe not sentencepiece, So the tokenizer should be load the bpe codes and the vocabs is differenct from the origin t5 model. Therefore i want to distill the big model to t5 model and use the vocab in the same time.

So I need to figure out two things: 1, whether the t5 can be train in dialogue without pretrain which treat the t5 like transformer without pretrain and i haven't find a relate case finetune from scratch. 2, how to set the tokenizer to just use bpe codes no sentencepiece or wordpiece, there'are not the same thing. 3, if i don't want to use tokenizer,then i just need vocab, because i can preprocess with bpe tokenizer, how should i do my training without tokenizer?