bytedance / MVDream

Multi-view Diffusion for 3D Generation
MIT License
820 stars 61 forks source link

Does camera conditioning affect style of generated images? #23

Open pgn-dev opened 1 year ago

pgn-dev commented 1 year ago

I was doing a small experiment on MVDream to evaluate consistency across generations with this piece of code:

from PIL import Image
import numpy as np
import torch 

from mvdream.camera_utils import get_camera
from mvdream.ldm.models.diffusion.ddim import DDIMSampler
from mvdream.model_zoo import build_model

def t2i(model, image_size, prompt, uc, sampler, step=20, scale=7.5, batch_size=8, ddim_eta=0., dtype=torch.float32, device="cuda", camera=None, num_frames=1, x_T=None):
    if type(prompt)!=list:
        prompt = [prompt]
    with torch.no_grad(), torch.autocast(device_type=device, dtype=dtype):
        c = model.get_learned_conditioning(prompt).to(device)
        c_ = {"context": c.repeat(batch_size,1,1)}
        uc_ = {"context": uc.repeat(batch_size,1,1)}
        if camera is not None:
            c_["camera"] = uc_["camera"] = camera
            c_["num_frames"] = uc_["num_frames"] = num_frames

        shape = [4, image_size // 8, image_size // 8]
        samples_ddim, _ = sampler.sample(S=step, conditioning=c_,
                                        batch_size=batch_size, shape=shape,
                                        verbose=False, 
                                        unconditional_guidance_scale=scale,
                                        unconditional_conditioning=uc_,
                                        eta=ddim_eta, x_T=x_T)
        x_sample = model.decode_first_stage(samples_ddim)
        x_sample = torch.clamp((x_sample + 1.0) / 2.0, min=0.0, max=1.0)
        x_sample = 255. * x_sample.permute(0,2,3,1).cpu().numpy()

    return list(x_sample.astype(np.uint8))

device = "cuda"
model = build_model("sd-v2.1-base-4view", ckpt_path=None)
model.device = device
model.to(device)
model.eval()

sampler = DDIMSampler(model)
uc = model.get_learned_conditioning( [""] ).to(device)

torch.manual_seed(12345)
torch.cuda.manual_seed_all(12345)

fixed_noise = torch.randn([8,4,32,32], device=device)

cameras = []

for azimuth_start in [90, 60]:
    camera = get_camera(4, elevation=15, azimuth_start=azimuth_start, azimuth_span=360)
    camera = camera.repeat(2,1).to(device)
    cameras.append(camera)

images = []

prompt = "gandalf smiling, 3D asset"

for camera in cameras:
     img = t2i(model, 256, prompt, uc, sampler, step=50, scale=10., batch_size=8, ddim_eta=0.0, 
                dtype=torch.float16, device=device, camera=camera, num_frames=4, x_T=fixed_noise)
     img = np.concatenate(img, 1)
     images.append(img)

images = np.concatenate(images, 0)

Image.fromarray(images).save("gandalf.png")

TL;DR: the code freezes the noise for two generations with different sets of camera angles.

The output looks like this:

gandalf

Although the styles for the two different sets of camera angles are similar, they are not the same. So I would not be able to create different views for one pair of (prompt, start seed) in a separate, independent generation unless I have the exact same set of camera positions.

Is this expected? Does the MVDream training regime introduce camera dependent styles?

yanjk3 commented 1 year ago

I think you can set a larger batch size to achieve "create different views for one pair of (prompt, start seed)". That is, you have to get the results in a round of generation (with 8 batch size). Else, if you split the one round of generation into two round s( 4 as batch size for 2 rounds) if fails to be consistent, as the two separate generation processes do not share the 3D Attn.

For example, do not use for azimuth_start in [90, 60]:, and just simply set --num_frames in the t2i.py as 8, 12, etc. This may address your problem. Or, you can specify the azimuth you want.

Besides, I think the MVDream training does not introduce camera dependent styles. In my opinion, the camera pose only affects the consistency of the generated object, not the generation style.

pgn-dev commented 1 year ago

I think you can set a larger batch size to achieve "create different views for one pair of (prompt, start seed)". That is, you have to get the results in a round of generation (with 8 batch size). Else, if you split the one round of generation into two round s( 4 as batch size for 2 rounds) if fails to be consistent, as the two separate generation processes do not share the 3D Attn.

For example, do not use for azimuth_start in [90, 60]:, and just simply set --num_frames in the t2i.py as 8, 12, etc. This may address your problem. Or, you can specify the azimuth you want.

I believe the public model was only trained for generating 4 views at a time so not sure how consistent it would be for 8 or 12 views. Moreover, consistent generation over separate processes would be interesting for 3D reconstruction.

yanjk3 commented 1 year ago

I think you can set a larger batch size to achieve "create different views for one pair of (prompt, start seed)". That is, you have to get the results in a round of generation (with 8 batch size). Else, if you split the one round of generation into two round s( 4 as batch size for 2 rounds) if fails to be consistent, as the two separate generation processes do not share the 3D Attn. For example, do not use for azimuth_start in [90, 60]:, and just simply set --num_frames in the t2i.py as 8, 12, etc. This may address your problem. Or, you can specify the azimuth you want.

I believe the public model was only trained for generating 4 views at a time so not sure how consistent it would be for 8 or 12 views. Moreover, consistent generation over separate processes would be interesting for 3D reconstruction.

I have tried to use MVDream to generate more than 4 views and it works. I think the model actually learns a strong priori from the camera pose. And for consistent generation over separate processes, currently I have no idea. But in 3D reconstruction or text-to-3D, it seems naturally provided by the Nerf (I guess). That is, due to the Nerf, even the generation results among multiple processes are not that consistent, the generated 3D content is still acceptable. The above statement is according to the experimental results I have done, but it may not be completely accurate. Looking forward to more discussions.

joshkiller commented 7 months ago

@pgn-dev , @yanjk3 Hi guys, I will be embarked on exciting internship for the next 5months and half on 3D generative AI and i was wondering if yous guys can share some of your experiences in that field with me. I'm just a beginner with 3D generation. so I'm looking forward reading from you Best regard