some question about CLIP Text-Image Direction Similarity

ayaanzhaque / instruct-nerf2nerf

Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions (ICCV 2023)

https://instruct-nerf2nerf.github.io/

MIT License

792 stars 70 forks source link

Closed cwwjyh closed 1 year ago

cwwjyh commented 1 year ago

In the below picture, does the paper mean that two scenes are selected, each edited ten times? Or a total of ten edits were made on two scenes? Then the clip text-image Direction similarity is supposed to be calculated case by case, and the sum is averaged.
When you calculated the clip text-image direction similarity, did you take the whole training set as input? Look forward to your detailed answer. Thank you！

ayaanzhaque commented 1 year ago

It is 10 total edits split across two scenes. Yes, the clip metrics are calculated per-scene and then averaged when reported.
We actually use the rendered images from the camera path used in the renders on the website. We felt it would be fair to perform the metrics on poses that were not in the capture sequence.

Hopefully these answer your questions!

cwwjyh commented 1 year ago

2. to

Whether two scenes here were selected at random？

Whether the 10 edits are obtained by editing each scene 5 times?

ayaanzhaque commented 1 year ago

The two scenes that we used were the face scene and the bear scene. We used all 3 edits of the bear scene and selected 7 edits from the face scene.