Training configs - Githubissues

ShengyuH commented 2 months ago

hi,

Thank you for open-sourcing this fantastic project! To fully reproduce your results, could you please share the number and type of GPUs you used for training the open-sourced model, as well as how long the training process took?

In the paper, I found the details: "We train our model for 2000 epochs (≈ 500,000 iterations) at a resolution of 512 × 512, using λ MSE = 1.0 and λ LPIPS = 0.25. We optimize using the Adam optimizer with a learning rate of 1.0×10⁻⁵, weight decay of 0.05, and gradient clipping at 0.5." However, this doesn't fully cover the hardware or timing specifics of your training setup.

Thanks in advance for your help!

Best Shengyu

ShengyuH commented 2 months ago

I also have a question regarding reproducing the results in Table 1. Is the released checkpoint the best-performing model? Could you also provide the command needed to fully reproduce the numbers?

I used the released model and ran the following section: https://github.com/btsmart/splatt3r/blob/5ab9f25bb07522424a5c934c4077c697d1546264/main.py#L367. However, I wasn’t able to achieve the same results. Also I noted that here: https://github.com/btsmart/splatt3r/blob/5ab9f25bb07522424a5c934c4077c697d1546264/main.py#L377 you specified "use_every_n_sample=10", is this really necessary ?

Below is an example of the output I obtained: {"alpha: 0.9, beta: 0.9, apply_mask: True, average_over_mask: False": [{"test/loss": 0.08159127831459045, "test/mse": 0.01828351803123951, "test/psnr": 18.183143615722656, "test/lpips": 0.25323110818862915, "test/ssim": 0.7490078806877136}]} Any insights would be appreciated!

btsmart commented 2 months ago

Hello,

Regarding the hardware, we performed our main experiments on 4 Nvidia RTX A6000 GPUs (each of which has about 49GB of memory), and I believe the model took around 9 hours to train, although I do not have the exact timings. I have been able to do small experimental training runs on a single RTX 2080ti, but I am unsure how long it would take to achieve comparable performance.

Regarding the model checkpoint, the model I previously shared was the model we use for the demo. I have now uploaded the model we used to calculate our metrics here. While this model achieves the best metrics, we did notice that it has a few more artifacts on the edges of the prediction. I was able to reproduce the metrics in the paper by modifying main.py to load the linked checkpoint model using MAST3RGaussians.load_from_checkpoint(path_to_ckpt) and skipping the trainer.fit step (and then running using python main.py configs/main.yaml).

Regarding use_every_n_sample=10, originally I was using testing sets of around 50,000 samples, however the code we used to render point clouds (see https://github.com/btsmart/splatt3r/issues/23) was slow and caused bottlenecks during evaluation, so use_every_n_sample is just used to reduce the testing sets to around 5,000 samples.

Hope this helps! If you have any more trouble reproducing the results please let me know.

ShengyuH commented 2 months ago

Thanks for the clarification and the new checkpoints. Really appreciated.

btsmart / splatt3r

Training configs #20