Open lizhiqi49 opened 9 months ago
Here are some examples. Empirically we found the joint training leads better quality and text-image consistency.
The examples are "a bulldog wearing a black pirate hat" and "an astronaut riding a horse". | No 2D data | 2D+3D Training |
---|---|---|
Thank you! The performance was really improved a lot. And I have another question:
You mentioned in your paper that you sample data batch from laion image dataset with 30% chance. When training with multi-view batch, the batch size is 4096 (1024x4), what's the number for 2D batch (1024 or 4096)?
We train the model with 32 A100 GPUs distributed on 4 nodes. Each node has a batch size of 256. So for each node:
The mode could be different for each node at the same step.
OK, thanks a lot.
Very nice work!
I have a question about 2D&3D joint training: I think it's very intuitive that only training with the synthetic 3D dataset will lead to degeneration on the quality of generated images and easily overfitting to the synthetic 3D data, so it should help to introduce high-quality 2D data into training. But since you didn't show the comparison of with/without 2D data in training, I want to know how much it has improved the generation quality in your practice. Thanks.