Hello, nice work and i am following it.
In my experiments, i found that as the viewpoints increases, the gpu memory will also increases,and if there is too much viewpoints, the results in some perspectives will be strange. I guess its because the sd base model is trained on 256256 images, but in syncmvd pipline, there is a concat operation. For example, 8 viewpoints will lead to generate 8256*256 pixel in one time step.So, as the viewpoints increases too much, the noise predict maybe failed.
I don't know if my understanding is correct, any help will be appreciate. Thanks!
Processing all views in one batch will face the memory issue as you describe. It should be able to group the views into smaller batches, since there isn't a fully-connected pairwise attention in this method.
Hello, nice work and i am following it. In my experiments, i found that as the viewpoints increases, the gpu memory will also increases,and if there is too much viewpoints, the results in some perspectives will be strange. I guess its because the sd base model is trained on 256256 images, but in syncmvd pipline, there is a concat operation. For example, 8 viewpoints will lead to generate 8256*256 pixel in one time step.So, as the viewpoints increases too much, the noise predict maybe failed. I don't know if my understanding is correct, any help will be appreciate. Thanks!