Reproducing results with multiple GPUs

RuijieZhu94 commented 3 months ago

Hi Yuedong, thank you for open source your great work!

When I trained the model using 3 Nvidia RTX 3090s (batch size 4 per GPU), I got significantly worse results on the re10k.

psnr 22.12379274863242
ssim 0.7298626045353773
lpips 0.22073094525619313

Will fewer batchsize or multi-GPU training significantly affect the performance of the model? By the way, I use the official weights and can get results consistent with the paper.

psnr 26.386906073201686
ssim 0.8690403559103327
lpips 0.12837660807718004

donydchen commented 3 months ago

Hi @RuijieZhu94, thanks for your interest in our work.

Yes, there is a small bug regarding feature extraction due to code cleaning. It is mainly related to (batch, view) dimension conversion, it does not affect the testing since testing keeps batch_size=1. We have already corrected it in our last commit (297338f54d74e7beb4ca5e0700dee22090b836a4). We have re-trained the model (after fixing the aforementioned bug) using both single GPU and multi-GPUs configurations, and they both reproduced the results of the released model.

Would you mind updating the code following our last commit (297338f54d74e7beb4ca5e0700dee22090b836a4) and re-training the model? Let us keep this commit open for you to update the results. For a quicker debugging process, your model should reach around PSNR=23 at step 10K with the updated code, which is around PSNR=20 at step 10K if it contains the aforementioned feature extraction bug.

By the way, we use batch_size=14 by default (a smaller batch_size might slightly harm the performance but should not be that much). And the LPIPS weight is 0.05; the lr scheduler is 1cycle with lr=2.e-4, as we have updated in this commit (660f49cb7127166af6221f1df2c0d09606b56270). Make sure you have also synchronised your code base (if you have made any changes) with the aforementioned commits.

RuijieZhu94 commented 3 months ago

Hi Yuedong, thanks for your prompt reply, I will retrain this model in the next few days.

RuijieZhu94 commented 3 months ago

Hi Yuedong, I retrained this model with bs=12, and got the result:

psnr 26.31555430801481
ssim 0.8676635705885196
lpips 0.12932708359573464

Thank you for your help.

boxuLibrary commented 1 month ago

@RuijieZhu94 Could you share the link of the training dataset? I reach out the author of the pixelsplat for link. However, i can not open the link.

RuijieZhu94 commented 1 month ago

@RuijieZhu94 Could you share the link of the training dataset? I reach out the author of the pixelsplat for link. However, i can not open the link.

Please contact me by email: ruijiezhu@mail.ustc.edu.cn.

donydchen / mvsplat

Reproducing results with multiple GPUs #14