RVT-2 model fails to converge with real-world data for a simple task

XiaohanLei commented 3 weeks ago

Content:

Problem Description

I'm attempting to train a RVT-2 model for a simple task: "lift the block". I've collected 10 demonstration samples in real-world scenarios for training, but the model shows no signs of convergence at all.

Environment

Task: Lift the block
Model: RVT (Robotics Vision Transformer)
Data: 10 real-world demonstration samples

Attempts

So far, I've only tried training with the 10 collected samples.

Questions

Is this issue primarily due to insufficient data?
What other potential reasons could be causing the model to fail to converge?
For such a simple task, approximately how many samples might be needed to see convergence?
Are there any suggestions to improve the training process or data collection method?

Additional Information

the former is the pointcloud, and the latter is the rendered results

Any help or advice would be greatly appreciated!

imankgoyal commented 3 weeks ago

Hi,

Thanks for your interest in our work. It seems like you are unable to fit on the training data.

Is this issue primarily due to insufficient data? I don't think so.
What other potential reasons could be causing the model to fail to converge? Can you share the loss curve? I would start by exploring hyperparameters like learning rate and disable any augmentation and regularization. Also, are the rendered images the same as RVT's virtual images? Note, RVT has 5 virtual images, while RVT-2 has 3.
For such a simple task, approximately how many samples might be needed to see convergence? In our experiments, we found 10 to be enough for generation. A lower number of samples should facilitate convergence. More samples only help in generalization, not train-time convergence.
Are there any suggestions to improve the training process or data collection method? Can you share some examples of collected data, i.e., the point cloud and the ground-truth robot pose?

XiaohanLei commented 3 weeks ago

I discover that it is due to my dataset being too small, which result in the cosine learning rate not rising much before the training complete. In other words, the learning rate is too low. Thank you for your kind response.

NVlabs / RVT