How to evaluate CLIP Text-Image Direction Similarity for edit results? #95

Open umutyazgan opened 6 months ago

umutyazgan commented 6 months ago

Hi! I was trying to replicate these CLIP Text-Image Direction Similarity results from the paper: image Here is how tried to do it:

  1. I trained a NeRF on the bear example (resolution: 497*369):
    ns-train nerfacto --data data/bear_resized/
  2. Edited the NeRF using in2n:
    ns-train in2n --data data/bear_resized/ --load-dir outputs/bear_resized/nerfacto/2024-03-07_111958/nerfstudio_models/ --pipeline.prompt "Turn the bear into a grizzly bear" --pipeline.guidance-scale 6.5 --pipeline.image-guidance-scale 1.5 --max-num-iterations 4000
  3. Exported 172 views from each, 86 from training view angles and 86 novel views. I did this by manually setting each training view as a keyframe and setting the FPS to 2 and the transition length to 1 s in the ns-viewer to generate camera paths. Then I ran these commands to render 172 view images from both unedited and edited NeRFs:
    ns-render camera-path --load-config outputs/bear_resized/nerfacto/2024-03-07_111958/config.yml --camera-path-filename data/bear_resized/camera_paths/2024-03-08-15-56-50.json --output-format images --output-path renders/bear_resized/images/2024-03-08-15-56-50-extra/
    ns-render camera-path --load-config outputs/bear_resized/in2n/2024-03-07_114733/config.yml --camera-path-filename data/bear_resized/camera_paths/2024-03-08-15-56-50.json --output-format images --output-path renders/grizzly_bear/images/2024-03-08-15-56-50/

    These exports are 1920*1080.

  4. Using the ClipSimilarity module you provided, I compared each of the unedited views to their corresponding edited views. I used these captions: "a statue of a bear", "a grizzly bear". Then I calculated the mean sim_direction over 172 views. My code looks like this:
    ## clip_metrics.py code above

def get_file_names(directory, extension): """Fetch all file names with a specific extension from the given directory.""" return [file for file in os.listdir(directory) if file.endswith(extension) and os.path.isfile(os.path.join(directory, file))]

def read_images(images_dir, extension="png"): """Reads image files from given directory and converts them into tensors.""" file_names = get_file_names(images_dir, extension) image_paths = [os.path.join(images_dir, file_name) for file_name in file_names] images = [Image.open(image_path).convert("RGB") for image_path in image_paths]

Changing the array shape from [h,w,c] to [1,c,w,h]

images = [torch.Tensor(np.array(image).T[None,:,:,:]) for image in images]
return images

def main():

Read and parse arguments

parser = ArgumentParser()
parser.add_argument("--original-dir", required=True, type=str)
parser.add_argument("--edited-dir", required=True, type=str)
parser.add_argument("--original-caption", required=True, type=str)
parser.add_argument("--edited-caption", required=True, type=str)
parser.add_argument("--seed", default=42, type=int)
args = parser.parse_args()
original_dir = Path(args.original_dir)
edited_dir = Path(args.edited_dir)
original_caption = args.original_caption
edited_caption = args.edited_caption
# Load original and edited views as tensors
original_views = read_images(original_dir, "jpg")
edited_views = read_images(edited_dir, "jpg")
clip_similarity = ClipSimilarity()
sim_dirs = []
# calculate CLIP Direction Similarity for each original/edited image pair
for i in range(len(original_views)):
    sim_0, sim_1, sim_direction, sim_image = clip_similarity(
        original_views[i], edited_views[i], original_caption, edited_caption
# Print mean directional similarity

if name=="main": main()

I ran the script like this:

python metrics/clip_metrics.py --original-dir renders/bear_resized/images/2024-03-08-15-56-50-extra/ --edited-dir renders/grizzly_bear/images/2024-03-08-15-56-50/ --original-caption "a statue of a bear" --edited-caption "a grizzly bear"

5. Result: **0.04** which is significantly lower than 0.16 reported in paper.
6. I trained more, until 30k steps. Result: **0.0095**. Even lower. The edited NeRF also looks worse for some reason so it makes sense that the mean CLIP Direction Similarity is lower.

I read in the paper that you made 10 edits across 2 scenes for the quantitative evaluation. So, maybe the mean score for other examples were better. But before trying this for another scene and different edits, I wanted to ask if I'm on the right track, is this how you calculated these scores, or am I doing something wrong?