EPFL-VILAB / omnidata

A Scalable Pipeline for Making Steerable Multi-Task Mid-Level Vision Datasets from 3D Scans [ICCV 2021]
Other
395 stars 49 forks source link

Different results between using the pretrained model locally and using the web demo #19

Closed ryunuri closed 1 year ago

ryunuri commented 1 year ago

Thank you for sharing this wonderful work!

I am having trouble reproducing the depth results obtained from the web demo, even when I use the 2nd version weight of the pretrained depth estimator. Is there any difference between the web demo and the released checkpoint? It seems that the web demo produces a better result...

Ainaz99 commented 1 year ago

Hi @terryryu !

Have you tried using the code in here to run the checkpoint? You need to do some post-processing on the depth outputs. Let me know if the issue wasn't resolved.

ryunuri commented 1 year ago

Oh yes I just followed the instructions in this link

Is there any post processing that needs to be done after running the demo.py?

As an example, I attached the results I'm currently getting below:

Input Image Depth from web demo Depth from local
000000 000000_depth 000000_depth_local
To-jak commented 1 year ago

Hello,

I was able to reproduce your results @terryryu by following the same instructions with your input image and omnidata_dpt_depth_v2.ckpt model checkpoint.

I wanted to do another test with an image from the asset folder:

python demo.py --task depth --img_path assets/demo/test4.png --output_path assets/

I wonder if the post processing code is complete, as we can see that there is quite a difference with the image used in the readme to demonstrate the results:

Input Image Depth from web demo Depth from local Depth from readme
test4 test4_web_demo test4_local_demo test4_readme
saiedg commented 1 year ago

Sorry to ask this here, any chance someone can show me a way to run this locally? I have a mac. Do you know if it's possible to run this on iOS? Thank you

ryunuri commented 1 year ago

Hi @Ainaz99, sorry to bother you again but could you give some help on the postprocessing step?

I think the 'quality difference' may be just a result of a different clamp range.

In demo.py line 140 ~ 144

        output = model(img_tensor).clamp(min=0, max=1)

        if args.task == 'depth':
            output = F.interpolate(output.unsqueeze(0), (512, 512), mode='bicubic').squeeze(0)
            output = output.clamp(0,1)
            output = 1 - output

We can see that the final output is clamped into the range [0., 1.] twice. I believe that the first clamp is indeed needed to constrain the output to be in a valid range. But, I think for the second clamp to take effect, we may need to adjust the clamp's range since the output has a very small range (around ~.08?). I tested my idea on the image I posted here by reducing the clamp range and I was able to get a better result, but not identical to the web demo.

Also, I guess the result from @To-jak also supports this claim. The readme's depth has more detail in the foreground, but the web demo and the local's result has a more detailed depth in the background. I think this might be the result of a different clipping range.

So, I just wanted to ask if my thoughts are not wrong. I also want to ask about how we should set the clamp range. Do we need to set this value by hand or is there an automatic way of doing it?

Again, thank you for sharing this cool work, and thank you for your help!

Ainaz99 commented 1 year ago

Hi @terryryu!

Sorry for my late reply and for the confusion! I think you are right about the second clamp. What you can do is to normalize the output between [0,1] first.

And here's the reason for the difference between the GitHub demo and the checkpoint results: The results of the GitHub demo are coming from a model trained on only Omnidata while the web demo results and the final released checkpoints are trained on both Omnidata + MiDaS. The final models are generally more accurate (especially for the outdoor scenes since MiDaS dataset is mostly outdoors while Omnidata is almost completely indoors) but at the cost of losing some details. Omnidata depths come from high resolution meshes while in MiDaS they are computed mostly from SFM methods and they are not very good in details. Hope this is clear enough!

ryunuri commented 1 year ago

Thank you so much for the explanation @Ainaz99, it made everything clear!

But, as a last question, I'd like to ask whether the final checkpoints will be released public. It seems that currently only the GitHub demo checkpoints are publicly released. Are there any plans for further checkpoint releases?

saiedg commented 1 year ago

@Ainaz99 where can we download the omnidata+midas models?

alexsax commented 1 year ago

These are (and were!) available here: https://github.com/EPFL-VILAB/omnidata/tree/main/omnidata_tools/torch#pretrained-models