The input image of clip

LiquidAmmonia commented 2 years ago

Hi, according to your code the original nerf training process, one would randomly choose a batch of rays (say 1024) in the training process and compare this to the ground truth pixel-wise values of the ground truth image sampled by the same set of rays. So the image sent to compute the clip loss is just a batch of random pixels, without any semantic information. Is my understanding correct? And if so, why would it be possible to compare this 'image' to the input prompt?

This is the image sent to the clip loss during your training process. fdc3d5653ab7244e5512b5eb7b42afc

cassiePython commented 2 years ago

In the stylization (editing) process, these rays are not randomly sampled without any semantics. Please see the 'get_select_inds' function in run_nerf_clip.py.

And you should set a large value for 'sample_scale' to make sure a clear patch (A small value will lead to sparsity/low-resolution. And a larger one may lead to OOM. It depends on your GPU).

LiquidAmmonia commented 2 years ago

Thank you for your reply. I will try that.

cassiePython / CLIPNeRF

The input image of clip #11