Chat-3D / Chat-Scene

A multi-modal large language model for 3D scene understanding, excelling in tasks such as 3D grounding, captioning, and question answering.
MIT License
81 stars 6 forks source link

About the epoches in Training&Inference. #34

Open KaKa-101 opened 4 months ago

KaKa-101 commented 4 months ago

Hi, thanks for your great work. I have the following two questions:

  1. Why do you set the epochs=3 during training and inference? And do you suggest me to set it to a higher value(like 10, 20, etc) and will this help improve the LLM's performance in tasks like grounding, Q&A, etc?
  2. Could you provide the code to visualize the b-box in the pictures? Thanks a lot again.
ZzZZCHS commented 4 months ago
  1. The epochs is only set for training. We didn't have much time or resources to test different epoch numbers. However, most recent MLLMs perform fine-tuning within one or two epochs, such as LLaVA 1.5. So I guess a higher epoch value would not help a lot. Our practice is to set epoch number to 3 and early-stop at the second epoch. You can experiment with different epoch numbers to observe any differences. I'm also curious about this, so if you find anything noteworthy, please share it with us!
  2. For visualization, you can refer to this issue. And we use MeshLab to visualize the generated PLY files.
KaKa-101 commented 3 months ago

Thanks for your reply. And I see you evaluated the model on Nr3D/Sr3D datasets (which are orginated from ReferIt3D benchmark) in your paper. But it seems that there isn't the part of Nr3D/Sr3D datasets in preprocess.

ZzZZCHS commented 3 months ago

We haven't evaluated the v2.1 model on Nr3D/Sr3D. We will add this in the future.

KaKa-101 commented 3 months ago

Thanks. Looking forward to your release~