About separation models & image to video

baaivision / Emu3

Next-Token Prediction is All You Need

Apache License 2.0

1.84k stars 73 forks source link

We currently release models for image generation and vision language understanding separately. The model shards (model-00001-of-00007 to model-00007-of-00007) are split by transformers automatically.

Our model contains 8B parameters and can run on a 40GB device with bfloat16 precision. The VL model can be deployed across multiple devices using the multi-GPU deployment feature provided by the transformers. However, the image generation model still has device mismatch issues due to the use of Classifier-Free Guidance (CFG) when using multi-device deployment and we are still working on it.

Besides, our Emu3 paradigm natively support image to video tasks since we predict the video frame by frame, token by token and we'll release the video generation model in the future. Please stay tuned for updates.

baaivision / Emu3

About separation models & image to video #12