about the quality of the generated video

flyingshan commented 1 year ago

I tried to generate some NeRF scenes using in2n with prompts described in the paper. And Here are some examples:

https://user-images.githubusercontent.com/42313652/233237010-a9afe882-5671-4120-9be4-ba5b0e831a02.mp4 （Give him a cowboy hat）

https://user-images.githubusercontent.com/42313652/233237024-640b2366-ae0b-41c6-9e81-54c58aa926fa.mp4 （Make him bald）

https://user-images.githubusercontent.com/42313652/233237040-e9f2002e-81fb-4a51-819b-d0b0a1734cec.mp4 （as a bronze bust）

I found the generated scenes' quality is not as good as the paper exhibits. I trained the NeRF using nerfacto and trained in2n using the default guidance scale following the document. Can you give me some advice on this? Thank you!

ayaanzhaque commented 1 year ago

The first and second one seem as if they are issues with the guidance scales. But I am slightly surprised the results are not so good. I will try to run a few of these myself and see what the difference is. Are you using the standard in2n model?

flyingshan commented 1 year ago

Yes, the training command is as follows: ns-train in2n --data /workspace/face --load-dir /workspace/outputs/face/nerfacto/2023-04-17_090548/nerfstudio_models --pipeline.prompt "as a bronze bust" --pipeline.guidance-scale 7.5 --pipeline.image-guidance-scale 1.5 --pipeline.ip2p-device cuda:1 --viewer.websocket-port 7008

ayaanzhaque commented 1 year ago

Ok thank you! There might be some issue in the code that I've released, I'm gonna take a closer look at it tonight to see what's wrong. Thank you for these results!

flyingshan commented 1 year ago

Thank you, and I found for some scenes it looks like it is as good as the results in the paper, for example:

https://user-images.githubusercontent.com/42313652/233240681-0b71260f-9ec6-4e74-8605-b2fe0d4c0299.mp4 “Put him in a suit”

ayaanzhaque commented 1 year ago

I see, that looks pretty good. By the way, I provided the camera paths in those data folders, so you can render the paths that we have on the website. When I was testing the code I was mostly trying it on the farm scene, and it looked good on that end.

Also, may I ask what resolution you are training the face scene at? Are you using the downscale factor of 2?

flyingshan commented 1 year ago

I just trained it with ns-train nerfacto --data face so I guess the input images were not downscaled?

ayaanzhaque commented 1 year ago

Oh i see, can you re-train both of these at a lower resolution?

Add the following to your nerfacto, then have it for ur in2n and also load the new nerfacto weights.

nerfstudio-data --downscale-factor 2

The only reason I suggest you do that is because the person scene is already downscaled to 512 and the training went quite well for you. Let me know if that changes anything on the face!

flyingshan commented 1 year ago

ok, I' ll try it!

ayaanzhaque commented 1 year ago

I'm trying the cowboy hat with this codebase right now, and it definitely doesn't look as bad as your result. Can you try it with guidance scale of 6.5 and image guidance scale of 1.3? These were the parameters used in the paper.

However, there still is a difference between the result I am getting now and the result we have in the paper. I will take a closer look as to what the issue is.

Do you have the latest version of the code pulled?

ayaanzhaque commented 1 year ago

Ok so I ran the cowboy both on this repo and my research repo. These are rendered at around 40k total iterations (30k from nerfacto, 10k from in2n). The files are labeled "new" and "old" respectively. The results look pretty similar in both cases, and I'd say both look pretty similar to what is in the paper.

https://user-images.githubusercontent.com/46949246/233277347-10939fb1-2d6a-4f25-834d-3f90a464e16c.mp4

https://user-images.githubusercontent.com/46949246/233277367-f445c40c-102a-4abe-8f8d-4a7fc1009a69.mp4

Here's the exact command that I ran: ns-train in2n --data data/nerfstudio/frederik/ --load-dir outputs/frederik/nerfacto/2023-04-05_193848/nerfstudio_models --pipeline.prompt "give him a cowboy hat" --pipeline.guidance-scale 6.5 --pipeline.image-guidance-scale 1.3 --pipeline.model.near-plane 0.2 nerfstudio-data --scene-scale 2.0 --downscale-factor 2

I guess the few parameters at the end to adjust scene scales are making a difference? I also could image that InstructPix2Pix does not perform well on full-resolution images. Also, maybe your results are over-optimized? Not sure what the delta is here.

ayaanzhaque commented 1 year ago

I pushed a minor change (change setting on Ip2p to improve performance slightly) which may be the cause for the difference? Go ahead and pull the most recent code and let me know what happens.

flyingshan commented 1 year ago

Thank you for your effort! I have run a experiment with the downscale factor setting as 2, and it looks much better than the full resolution result, I will try setting other parameters as you say and using the lateset repo, I'll report the result when I get it.

ayaanzhaque commented 1 year ago

I see, then I suppose it is because InstructPix2Pix doesn't do well on high-resolution images. I'll make a more clear note of that in the readme.

Also, what are your GPU specs? I also forgot to put a note for this, but the model doesn't need to train for more than 10k iters in most cases, so you could render the model at around 40k iters if you would like. Let me know what you find

ayaanzhaque commented 1 year ago

Here's what the bronze looks like, I think it looks good. I'll put the command after it.

https://user-images.githubusercontent.com/46949246/233293892-424a1aac-e8aa-4450-994e-3d0a90bbecb4.mp4

ns-train in2n --data data/nerfstudio/frederik/ --load-dir outputs/frederik/nerfacto/2023-04-05_193848/nerfstudio_models/ --pipeline.prompt "as a bronze statue" --pipeline.guidance-scale 7.5 --pipeline.image-guidance-scale 1.3 --pipeline.model.near-plane 0.2 nerfstudio-data --scene-scale 2.0 --downscale-factor 2

This definitely leads me to believe that the issue is the high-resolution images used with InstructPix2Pix is a problem. Let me know what your results look like.

flyingshan commented 1 year ago

I have tried using your commands to train these two prompts with the latest repo, the results look very similar to your results:

https://user-images.githubusercontent.com/42313652/233519621-cf87912d-24ea-467b-8434-689cbabff50e.mp4

https://user-images.githubusercontent.com/42313652/233519593-89f7b280-301d-4969-90b4-03a0d7cca4fa.mp4

Btw, I notice that here from readme gives tips on how to set the guidance-scale related hyper-parameters, but I also notice you set other hyper-parameters like "pipeline.model.near-plane", "--scene-scale" in the commands. Can you give us some advice on how to set these hyper-parameters?

ayaanzhaque commented 1 year ago

Great to see your results look good!

In terms of the near-plane and scene scale, those parameters were set because Nerfacto is primarily meant for unbounded, real-world scenes. However, the scene of this face is a front-facing scene, and so nerfacto creates a lot of floaters. I'd recommend going to the main nerfstudio issues section to get more tips on how to train for front-facing scenes!

ayaanzhaque / instruct-nerf2nerf

about the quality of the generated video #14