csuhan / OneLLM

[CVPR 2024] OneLLM: One Framework to Align All Modalities with Language
Other
552 stars 27 forks source link

Is the Tonizer.py code converting modelities into tokens? #14

Closed Hassaan68 closed 6 months ago

Hassaan68 commented 6 months ago

It seems that tokenizer.model is the pretrained SentencePiece model. can we have the access to training code of this model so that we can add training to tokenize more modalities??

csuhan commented 6 months ago

tokenizer.model is just fir text processing. For other modalities, we use a conv layer to transform input data into tokens.

Hassaan68 commented 6 months ago

@csuhan thank you for clarification