baaivision / Emu3

Next-Token Prediction is All You Need
Apache License 2.0
1.84k stars 73 forks source link

About separation models & image to video #12

Open AA-Developer opened 1 month ago

AA-Developer commented 1 month ago

hi I have looked at the model and it is a really powerful model

But the problem is in merging the models together, this consumes a lot of gpu power.

If it is possible to separate the models from each other, such as the image generator alone and the video alone When you separate the models, they can even be used in regular devices such as the RTX 4090 24GB.

And I hope to add a separate model from image to video, this model is in high demand, but all the models available so far are somewhat weak and not of high quality and stable in the movement of people or physics

ryanzhangfan commented 1 month ago

We currently release models for image generation and vision language understanding separately. The model shards (model-00001-of-00007 to model-00007-of-00007) are split by transformers automatically.

Our model contains 8B parameters and can run on a 40GB device with bfloat16 precision. The VL model can be deployed across multiple devices using the multi-GPU deployment feature provided by the transformers. However, the image generation model still has device mismatch issues due to the use of Classifier-Free Guidance (CFG) when using multi-device deployment and we are still working on it.

Besides, our Emu3 paradigm natively support image to video tasks since we predict the video frame by frame, token by token and we'll release the video generation model in the future. Please stay tuned for updates.