Results reproduction for Scene3D dataset

BobrG commented 1 month ago

Hi!

First of all, I would like to thank you for the excellent work on the paper "Learning Online Multi-Sensor Depth Fusion" and for making the code available.

However, I encountered some problems when reproducing the results as described in the repository on Scene3D dataset. I followed the steps covered here to prepare dataset, then trained the method for ~4 hours on stonewall scene like described in the training section and finally tested my method on copyroom scene like described in the test section. For training I used Tesla P100 GPU.

The results I obtained look worse than those reported in the paper.

Visualization from paper:

My reproduction results:

Mention the hole on the floor of the room reconstruction and higher error on the cylindric box near left wall of the room.

My metrics looked a bit weird on validation, however I did not have any example to compare them:

What do you think about this results? Is it okay for the method to perform like this or I did something wrong during training and/or evaluation?

eriksandstroem commented 3 weeks ago

Hi, Sorry for the late reply and also for taking interest in our work! I think your metric curves look reasonable. I am also a bit confused why you have a hole in the floor. This should not happen since both sensors observe data there. That would suggest something is strange with the setup. Are you doing ToF + MVS stereo fusion?

BobrG commented 1 week ago

@eriksandstroem thanks a lot for an anwer! I have checked the setup -- everything looks in line with the description in git. The only thing that I mentioned rn to be kinda off -- I run the script for 10th frame saving (mentioned here) as it is here. And in the way it is implemented it works with paths to the data from .tar files that you provide from MVS method (images/ and stonewall_png/depth/ for stonewall scene for example). Eventually it leaves only ~30 images and depthmaps for training, while there should be probably ~300 of them. Is this correct?

eriksandstroem commented 1 week ago

Hi again, Only every 10th image should be used during training and inference (we train and test on different scenes). I don't remember how many frames these scenes contain in total, but if you say that only 30 are saved, but it should be 300, it sounds like something is strange, no?

BobrG commented 1 week ago

I do not know how many training images there should be either, it is just my assumption that I can have a mistake in this place. What I would like to know is should I run save_every_10th_frame script on data from these arxives:

Download the stonewall and copy room scenes of the Scene3D dataset available here.

Or on these

Next, download the MVS depth sensor for both scenes and ground truth SDF grid for the stonewall training scene. We were not able to construct a good ground truth SDF grid for the copy room scene thus only the F-score evaluation is accurate at test time. Download links: stonewall, copy room.

The script that you provide runs as in the second scenario.

eriksandstroem / SenFuNet

Results reproduction for Scene3D dataset #1