Pre-training needs 3D annotations/supervision

zhangjb416 commented 1 year ago

Hi! Amazing work! But I have a question about the pre-training setting.

This is a supervised pre-training method. Seems that heavy 3D annotations and supervision are needed, for example, the semantic label for each point. With these annotations, a semantic segmentation model can be well trained.

So I am wondering the how much extra benefit can introducing text information / CLIP bring?

And is it entirely fair to compare with unsupervised pre-training method, such as CSC (CVPR 2021)?

RozDavid commented 1 year ago

Hey @zhangjb416,

First of all thanks a lot for the comment and the appreciation. You are right, that indeed this is a supervised pretraining method in contrast (pun intended) with CSC. While in many situations that require unsupervised downstream tasks it wouldn't be a fair comparison, but we were focusing on the (at least for some amount) supervised semantic and instance segmentation challenges. In this sense, we argued that our method is still a fair comparison with CSC, as we are only using the available supervision both for pretraing and finetuning.

And the benefit that the text representation gives is a more optimal representation space structure - where similar categories clusters are lying close to each other - and that wouldn't be possible to learn with only the limited 3D data available in ScanNet or other 3D datasets.

Let me know if this answered your question or was satisfactory enough, I am happy to continue the discussion if you have follow-up concerns.

Cheers, David

zhangjb416 commented 1 year ago

Thanks for the detailed response!

Now I understand that you pre-train the model only with the annotations that downstream tasks need (which is also mentioned in your paper). And I agree that the comparison is fair.

Still I have a very small concern. For a pre-training method, it seems a little bit weird that the pre-training settings (such as the needed annotations) should be determined by downstream tasks' setting :)

RozDavid commented 1 year ago

Right, I get your point and you are right - in this setup we cannot get as general as fully unsupervised methods. On the other hand, for this project that wasn't the top priority, but to squeeze better results for supervised scene understanding with the available data was.

Closing the issue now as the main concern seems to be discussed. Feel free to reopen for follow ups though!

RozDavid / LanguageGroundedSemseg

Pre-training needs 3D annotations/supervision #2