Open OrangeSodahub opened 3 months ago
@Tangshitao I'm a little bit confused that if this codebase could be trained correctly, while I'm sure that inference has no problem. I wonder if there is anyone else trained successfully.
I feel you motion is too large, which could cause inconsistent generation. Can you try data with small motions?
@Tangshitao Thanks. But with small motions this issue still exists:
Even though the motion too large causes inconsistent generation, the quality of generations shouldn't degrade. Now the generations are totally meaningless.
@Tangshitao I observe exactly the same problem. Very curious about the reason.
I feel there might be extrinsic and intrinsic issues. Can you try to train with Scannet data?
I'm sure that cameras have no problems. I found that the method of sampling frames to form a batch data is crucial, how do you produce the key_frames_0.6.txt
files?
I compute the overlap between each frame within a video, can record frame pair with overlap larger than 0.6
@Tangshitao Hi, I'm so confused. I tried your codebase, and I intergrated your training script (depth) into another diffuser pipeline then tried again, this issue always exist -- The content disappeared gradually during training.
I really want to know is there anyone trained depth version successfully? Because I can't find out the problem. Big thanks!
Have tried to train the model with the codes released instead of intergrated my codes into another codebase?
Yes, I first tried that, then I tried to integrated one. Both of them have similar results, the outputs get disappeared
By the way, can I just train your depth version model but replace the SD-2-Depth
with SD1.5
? I'm not sure will it impact the performance of the correspondence-aware attention layer.
Hi, I'm trying to train the depth-conditoned model from scratch on custom data, and confused about the results:
val at step=70:![image](https://github.com/Tangshitao/MVDiffusion/assets/54439582/5135447c-7b23-426a-92da-61f43cf17782)
val at step=140:![image](https://github.com/Tangshitao/MVDiffusion/assets/54439582/4ffc745d-ce1f-4f8f-b856-ba8c39cb18ed)
val at step 210:![image](https://github.com/Tangshitao/MVDiffusion/assets/54439582/cca7dac4-8961-4ed7-b83f-9e3ef6285de4)
val at step 280:![image](https://github.com/Tangshitao/MVDiffusion/assets/54439582/70fbbb9e-c551-49c2-90d2-c077f95cb989)
val at step 350:![image](https://github.com/Tangshitao/MVDiffusion/assets/54439582/72c14df2-f4bf-44ba-a337-f12b50e94845)
val at step 420:![image](https://github.com/Tangshitao/MVDiffusion/assets/54439582/a5e193f1-2c04-4f10-be7a-d844682a7f02)
As the example shows, the prediction outputs gradually becomes blurred until there is no content, I wonder if it is correct phenomenon? Or did you occurred this issue before? Thanks.