Update for stable diffusion v2.0; Difference between encoders of stable diffusion & openclip

Junyi42 commented 1 year ago

Hey, I was trying for the most recent stable diffusion v2, and find only below changes make it run well.

Describe alternatives you've considered In sd.py, from:

    # 1. Load the autoencoder model which will be used to decode the latents into image space. 
    self.vae = AutoencoderKL.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="vae", use_auth_token=self.token).to(self.device)

    # 2. Load the tokenizer and text encoder to tokenize and encode the text. 
    self.tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
    self.text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(self.device)

    # 3. The UNet model for generating the latents.
    self.unet = UNet2DConditionModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="unet", use_auth_token=self.token).to(self.device)

change to:

      # 1. Load the autoencoder model which will be used to decode the latents into image space. 
      self.vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-2-base", subfolder="vae", use_auth_token=self.token).to(self.device)

      # 2. Load the tokenizer and text encoder to tokenize and encode the text. 
      self.tokenizer = CLIPTokenizer.from_pretrained("stabilityai/stable-diffusion-2-base", subfolder="tokenizer", use_auth_token=self.token)
      self.text_encoder = CLIPTextModel.from_pretrained("stabilityai/stable-diffusion-2-base", subfolder="text_encoder", use_auth_token=self.token).to(self.device)

      # 3. The UNet model for generating the latents.
      self.unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-2-base", subfolder="unet", use_auth_token=self.token).to(self.device)

Two points really confuse me are

This change yileds a really poor result compared to previous.
For the tokenizer and text_encoder, what's the difference between directly applying the submodule of stable diffusion and applying the openclip? Since CLIP ViT/H is paired with stable diffusion 2, original encoder simply dosen't fit.

Any help will be greatly appreciated!

ashawkey commented 1 year ago

@Junyi42 Hi, thanks for the effort!

I'm trying 2.0-base too, what prompts are you using that generates worse results compared to 1.5?
I think the submodule should work too, and for 2.0 this is the only choice.

flobotics commented 1 year ago

@ashawkey tried new version with stable-diffusion 2.0, but i get this error ? The previous version was running, i did only a "git pull" ? What do i wrong ?

 python main.py --text "a hamburger" --workspace trial2 -O
Namespace(text='a hamburger', negative='', O=True, O2=False, test=False, save_mesh=False, eval_interval=10, workspace='trial2', guidance='stable-diffusion', seed=0, iters=10000, lr=0.001, ckpt='latest', cuda_ray=True, max_steps=512, num_steps=64, upsample_steps=32, update_extra_interval=16, max_ray_batch=4096, albedo=False, albedo_iters=1000, uniform_sphere_rate=0.5, bg_radius=1.4, density_thresh=10, fp16=True, backbone='grid', sd_version='2.0', w=64, h=64, jitter_pose=False, bound=1, dt_gamma=0, min_near=0.1, radius_range=[1.0, 1.5], fovy_range=[40, 70], dir_text=True, suppress_face=False, angle_overhead=30, angle_front=60, lambda_entropy=0.0001, lambda_opacity=0, lambda_orient=0.01, lambda_smooth=0, gui=False, W=800, H=800, radius=3, fovy=60, light_theta=60, light_phi=0, max_spp=1)
NeRFNetwork(
  (encoder): GridEncoder: input_dim=3 num_levels=16 level_dim=2 resolution=16 -> 2048 per_level_scale=1.3819 params=(903480, 2) gridtype=tiled align_corners=False interpolation=linear
  (sigma_net): MLP(
    (net): ModuleList(
      (0): Linear(in_features=32, out_features=64, bias=True)
      (1): Linear(in_features=64, out_features=64, bias=True)
      (2): Linear(in_features=64, out_features=4, bias=True)
    )
  )
  (encoder_bg): FreqEncoder: input_dim=3 degree=4 output_dim=27
  (bg_net): MLP(
    (net): ModuleList(
      (0): Linear(in_features=27, out_features=64, bias=True)
      (1): Linear(in_features=64, out_features=3, bias=True)
    )
  )
)
[INFO] try to load hugging face access token from the default place, make sure you have run `huggingface-cli login`.
[INFO] loading stable diffusion...
The config attributes {'dual_cross_attention': False, 'use_linear_projection': True} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Traceback (most recent call last):
  File "C:\Users\SuperUserName\git\stable-dreamfusion\main.py", line 141, in <module>
    guidance = StableDiffusion(device, opt.sd_version)
  File "C:\Users\SuperUserName\git\stable-dreamfusion\nerf\sd.py", line 47, in __init__
    self.unet = UNet2DConditionModel.from_pretrained(model_key, subfolder="unet", use_auth_token=self.token).to(self.device)
  File "C:\Users\SuperUserName\anaconda3\lib\site-packages\diffusers\modeling_utils.py", line 412, in from_pretrained
    model, unused_kwargs = cls.from_config(
  File "C:\Users\SuperUserName\anaconda3\lib\site-packages\diffusers\configuration_utils.py", line 169, in from_config
    model = cls(**init_dict)
  File "C:\Users\SuperUserName\anaconda3\lib\site-packages\diffusers\configuration_utils.py", line 406, in inner_init
    init(self, *args, **init_kwargs)
  File "C:\Users\SuperUserName\anaconda3\lib\site-packages\diffusers\models\unet_2d_condition.py", line 135, in __init__
    down_block = get_down_block(
  File "C:\Users\SuperUserName\anaconda3\lib\site-packages\diffusers\models\unet_blocks.py", line 65, in get_down_block
    return CrossAttnDownBlock2D(
  File "C:\Users\SuperUserName\anaconda3\lib\site-packages\diffusers\models\unet_blocks.py", line 508, in __init__
    out_channels // attn_num_head_channels,
TypeError: unsupported operand type(s) for //: 'int' and 'list'

flobotics commented 1 year ago

i did inside the anaconda prompt "pip install --upgrade diffusers[torch]" . Then it complained about missing tensorboard, which i installed with "pip install tensorboard" , now it returns :

 python main.py --text "a hamburger" --workspace trial2 -O
Namespace(text='a hamburger', negative='', O=True, O2=False, test=False, save_mesh=False, eval_interval=10, workspace='trial2', guidance='stable-diffusion', seed=0, iters=10000, lr=0.001, ckpt='latest', cuda_ray=True, max_steps=512, num_steps=64, upsample_steps=32, update_extra_interval=16, max_ray_batch=4096, albedo=False, albedo_iters=1000, uniform_sphere_rate=0.5, bg_radius=1.4, density_thresh=10, fp16=True, backbone='grid', sd_version='2.0', w=64, h=64, jitter_pose=False, bound=1, dt_gamma=0, min_near=0.1, radius_range=[1.0, 1.5], fovy_range=[40, 70], dir_text=True, suppress_face=False, angle_overhead=30, angle_front=60, lambda_entropy=0.0001, lambda_opacity=0, lambda_orient=0.01, lambda_smooth=0, gui=False, W=800, H=800, radius=3, fovy=60, light_theta=60, light_phi=0, max_spp=1)
NeRFNetwork(
  (encoder): GridEncoder: input_dim=3 num_levels=16 level_dim=2 resolution=16 -> 2048 per_level_scale=1.3819 params=(903480, 2) gridtype=tiled align_corners=False interpolation=linear
  (sigma_net): MLP(
    (net): ModuleList(
      (0): Linear(in_features=32, out_features=64, bias=True)
      (1): Linear(in_features=64, out_features=64, bias=True)
      (2): Linear(in_features=64, out_features=4, bias=True)
    )
  )
  (encoder_bg): FreqEncoder: input_dim=3 degree=4 output_dim=27
  (bg_net): MLP(
    (net): ModuleList(
      (0): Linear(in_features=27, out_features=64, bias=True)
      (1): Linear(in_features=64, out_features=3, bias=True)
    )
  )
)
[INFO] try to load hugging face access token from the default place, make sure you have run `huggingface-cli login`.
[INFO] loading stable diffusion...
C:\Users\SuperUserName\anaconda3\lib\site-packages\diffusers\utils\deprecation_utils.py:35: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddim.DDIMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  warnings.warn(warning + message, FutureWarning)
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 308/308 [00:00<00:00, 309kB/s]
C:\Users\SuperUserName\anaconda3\lib\site-packages\huggingface_hub\file_download.py:123: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\SuperUserName\.cache\huggingface\diffusers. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
[INFO] loaded stable diffusion!
[INFO] Trainer: df | 2022-12-03_17-13-08 | cuda | fp16 | trial2
[INFO] #parameters: 1815479
[INFO] Loading latest checkpoint ...
[WARN] No checkpoint found, model randomly initialized.
==> Start Training trial2 Epoch 1, lr=0.010000 ...
  0% 0/100 [00:00<?, ?it/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\SuperUserName\git\stable-dreamfusion\main.py:160 in <module>                                    │
│                                                                                                  │
│   157 │   │   │   valid_loader = NeRFDataset(opt, device=device, type='val', H=opt.H, W=opt.W,   │
│   158 │   │   │                                                                                  │
│   159 │   │   │   max_epoch = np.ceil(opt.iters / len(train_loader)).astype(np.int32)            │
│ ❱ 160 │   │   │   trainer.train(train_loader, valid_loader, max_epoch)                           │
│                                                                                                  │
│ C:\Users\SuperUserName\git\stable-dreamfusion\nerf\utils.py:486 in train                                 │
│                                                                                                  │
│   483 │   │   for epoch in range(self.epoch + 1, max_epochs + 1):                                │
│   484 │   │   │   self.epoch = epoch                                                             │
│   485 │   │   │                                                                                  │
│ ❱ 486 │   │   │   self.train_one_epoch(train_loader)                                             │
│   487 │   │   │                                                                                  │
│   488 │   │   │   if self.workspace is not None and self.local_rank == 0:                        │
│   489 │   │   │   │   self.save_checkpoint(full=True, best=False)                                │
│                                                                                                  │
│ C:\Users\SuperUserName\git\stable-dreamfusion\nerf\utils.py:698 in train_one_epoch                       │
│                                                                                                  │
│   695 │   │   │   # update grid every 16 steps                                                   │
│   696 │   │   │   if self.model.cuda_ray and self.global_step % self.opt.update_extra_interval   │
│   697 │   │   │   │   with torch.cuda.amp.autocast(enabled=self.fp16):                           │
│ ❱ 698 │   │   │   │   │   self.model.update_extra_state()                                        │
│   699 │   │   │                                                                                  │
│   700 │   │   │   self.local_step += 1                                                           │
│   701 │   │   │   self.global_step += 1                                                          │
│                                                                                                  │
│ C:\Users\SuperUserName\anaconda3\lib\site-packages\torch\autograd\grad_mode.py:27 in decorate_context    │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ C:\Users\SuperUserName\git\stable-dreamfusion\nerf\renderer.py:625 in update_extra_state                 │
│                                                                                                  │
│   622 │   │   │   │   │   │   # add noise in [-hgs, hgs]                                         │
│   623 │   │   │   │   │   │   cas_xyzs += (torch.rand_like(cas_xyzs) * 2 - 1) * half_grid_size   │
│   624 │   │   │   │   │   │   # query density                                                    │
│ ❱ 625 │   │   │   │   │   │   sigmas = self.density(cas_xyzs)['sigma'].reshape(-1).detach()      │
│   626 │   │   │   │   │   │   # assign                                                           │
│   627 │   │   │   │   │   │   tmp_grid[cas, indices] = sigmas                                    │
│   628                                                                                            │
│                                                                                                  │
│ C:\Users\SuperUserName\git\stable-dreamfusion\nerf\network_grid.py:150 in density                        │
│                                                                                                  │
│   147 │   def density(self, x):                                                                  │
│   148 │   │   # x: [N, 3], in [-bound, bound]                                                    │
│   149 │   │                                                                                      │
│ ❱ 150 │   │   sigma, albedo = self.common_forward(x)                                             │
│   151 │   │                                                                                      │
│   152 │   │   return {                                                                           │
│   153 │   │   │   'sigma': sigma,                                                                │
│                                                                                                  │
│ C:\Users\SuperUserName\git\stable-dreamfusion\nerf\network_grid.py:80 in common_forward                  │
│                                                                                                  │
│    77 │   │   # x: [N, 3], in [-bound, bound]                                                    │
│    78 │   │                                                                                      │
│    79 │   │   # sigma                                                                            │
│ ❱  80 │   │   h = self.encoder(x, bound=self.bound)                                              │
│    81 │   │                                                                                      │
│    82 │   │   h = self.sigma_net(h)                                                              │
│    83                                                                                            │
│                                                                                                  │
│ C:\Users\SuperUserName\anaconda3\lib\site-packages\torch\nn\modules\module.py:1130 in _call_impl         │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ C:\Users\SuperUserName\git\stable-dreamfusion\gridencoder\grid.py:156 in forward                         │
│                                                                                                  │
│   153 │   │   prefix_shape = list(inputs.shape[:-1])                                             │
│   154 │   │   inputs = inputs.view(-1, self.input_dim)                                           │
│   155 │   │                                                                                      │
│ ❱ 156 │   │   outputs = grid_encode(inputs, self.embeddings, self.offsets, self.per_level_scal   │
│   157 │   │   outputs = outputs.view(prefix_shape + [self.output_dim])                           │
│   158 │   │                                                                                      │
│   159 │   │   #print('outputs', outputs.shape, outputs.dtype, outputs.min().item(), outputs.ma   │
│                                                                                                  │
│ C:\Users\SuperUserName\anaconda3\lib\site-packages\torch\cuda\amp\autocast_mode.py:110 in decorate_fwd   │
│                                                                                                  │
│   107 │   def decorate_fwd(*args, **kwargs):                                                     │
│   108 │   │   if cast_inputs is None:                                                            │
│   109 │   │   │   args[0]._fwd_used_autocast = torch.is_autocast_enabled()                       │
│ ❱ 110 │   │   │   return fwd(*args, **kwargs)                                                    │
│   111 │   │   else:                                                                              │
│   112 │   │   │   autocast_context = torch.is_autocast_enabled()                                 │
│   113 │   │   │   args[0]._fwd_used_autocast = False                                             │
│                                                                                                  │
│ C:\Users\SuperUserName\git\stable-dreamfusion\gridencoder\grid.py:54 in forward                          │
│                                                                                                  │
│    51 │   │   else:                                                                              │
│    52 │   │   │   dy_dx = None                                                                   │
│    53 │   │                                                                                      │
│ ❱  54 │   │   _backend.grid_encode_forward(inputs, embeddings, offsets, outputs, B, D, C, L, S   │
│    55 │   │                                                                                      │
│    56 │   │   # permute back to [B, L * C]                                                       │
│    57 │   │   outputs = outputs.permute(1, 0, 2).reshape(B, L * C)                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: grid_encode_forward(): incompatible function arguments. The following argument types are supported:
    1. (arg0: at::Tensor, arg1: at::Tensor, arg2: at::Tensor, arg3: at::Tensor, arg4: int, arg5: int, arg6: int, arg7: int, arg8: float, arg9: int, arg10: Optional[at::Tensor], arg11: int, arg12: bool) -> None

Invoked with: tensor([[0.0062, 0.0011, 0.0017],
        [0.0064, 0.0054, 0.0135],
        [0.0018, 0.0071, 0.0187],
        ...,
        [0.9993, 0.9997, 0.9817],
        [0.9962, 0.9957, 0.9886],
        [0.9980, 0.9975, 0.9924]], device='cuda:0'), tensor([[-7.7486e-07,  5.3644e-05],
        [-8.2314e-05, -7.3612e-05],
        [-3.8505e-05,  2.6822e-05],
        ...,
        [-6.2644e-05, -2.3842e-06],
        [-7.7724e-05, -8.1122e-05],
        [-1.8597e-05, -7.2241e-05]], device='cuda:0', dtype=torch.float16), tensor([     0,   4920,  18744,  51512, 117048, 182584, 248120, 313656, 379192,
        444728, 510264, 575800, 641336, 706872, 772408, 837944, 903480],
       device='cuda:0', dtype=torch.int32), tensor([[[0., 0.],
         [0., 0.],
         [0., 0.],
         ...,
         [0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.],
         ...,
         [0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.],
         ...,
         [0., 0.],
         [0., 0.],
         [0., 0.]],

        ...,

        [[0., 0.],
         [0., 0.],
         [0., 0.],
         ...,
         [0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.],
         ...,
         [0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.],
         ...,
         [0., 0.],
         [0., 0.],
         [0., 0.]]], device='cuda:0', dtype=torch.float16), 2097152, 3, 2, 16, 0.46666666666666684, 16, None, 1, False, 0
  0% 0/100 [00:00<?, ?it/s]

ashawkey commented 1 year ago

@flobotics Hi, you should rebuild gridencoder too: pip install ./gridencoder.

flobotics commented 1 year ago

@ashawkey thanks it works.

if the results are better/faster i dont know now :) (still interrested in cloud-gpu usage :))

good work

Junyi42 commented 1 year ago

@Junyi42 Hi, thanks for the effort!

I'm trying 2.0-base too, what prompts are you using that generates worse results compared to 1.5?

I think the submodule should work too, and for 2.0 this is the only choice.

Thanks for the reply!

I tried "a doll", "a hotdog", and "a boy" for the stable diffusion 2.0, all of them yield a very simple scene while stable diffusion 1.5 provides plausible results. It's worth noting that all the above trials were using vanilla NeRF backbone, and --albedo, --lambda_entropy 1e-5 were set to avoid empty scenes. I think these settings may affect the results and I am trying the other backbone too (I'll update once I find something).
Thanks, my confusion is resolved.

Thanks again for the wonderful work!

ashawkey / stable-dreamfusion

Update for stable diffusion v2.0; Difference between encoders of stable diffusion & openclip #100