Closed WaseemDR closed 6 months ago
We do not feed camera information to our current models -- since we have co-training data from our target scene & viewpoint, this is not a problem since the model learns to "understand" the target viewpoint. That being said -- see my comment to your question in the Octo repo: for DROID the viewpoint diversity is large, so a model that wants to fit the dataset well needs to implicitly learn to register the camera with the base, or you need to provide the camera information as part of the model input (but then need to calibrate cameras for model evals). Again, this wasn't necessary for our tests since we did not try to "0-shot" a new viewpoint but had training data from our target scene + viewpoint.
Thanks, Karl. Understood. But in this case aren't we losing valuable information that can be transformed between the different variations of viewpoints (not only for the "0-shot" case), rather than learning it from "scratch". Don't you think so?
Another thought: potentially it could be beneficial to boost training also for the multiple viewpoints in the same setup.
Including extrinsic info as policy input can definitely make sense. The two tradeoffs are that (1) you'll need to calibrate your cameras at test time (and re-calibrate if they ever get bumped) and (2) some of the extrinsics data in the dataset may be noisy, so your policy will need to learn to deal with this noisy input.
But overall agreed that this is def a reasonable thing to try!
Makes total sense. Appreciated for your inputs.
Hi,
Great work, thanks for sharing.
so far I'm able to see the information of the camera extrinsic is being recorded in raw data. but beyond that I can't see how it is being used by the processed data or the model? is there an explicit use of that information for training or the training is agnostic to that information and it learns the different camera PoV implicitly?
Thanks.