Closed sriharshakoribille closed 1 year ago
@zephirefaith can comment on this We kept most parameters the same, are you referring to Stretch or Franka experiments? For Stretch we used the head camera For Franka we used a camera on the end of the arm, and just had it move to a couple predefined "look" poses on one side of a table
I was curious about the experiments on Stretch (Table 3 in the paper), since they involved mobile manipulation. Since the stretch robot would be moving, for PerAct would the input voxel grid be static w.r.t to the world or would it be dynamic moving with the robot. And if the voxel grid dimensions were increased to cover the mobile manipulation scene
It's static wrt the robot base, so it will always be a fixed volume near the robot. Unfortunately we didn't end up doing whole scenes, although that's part of the plan - if you're interested in that direction, i think 3d-llm has a similar architecture with a perceiver backbone: https://vis-www.cs.umass.edu/3dllm/
So this seems to indicate the approach will scale.
You would/should apply this to a voxel grid in world coordinates if you wanted to do this.
Ah okay. Thank you for the information and additional directions! @cpaxton
@sriharshakoribille Glad to know you found our research useful! Thanks for responding @cpaxton, but to add some more detail: on Stretch we changed the voxel volume of PerAct's input to be 1.5mx1.5mx1.5m which is coarser than the one used for Franka. This is both due to the different embodiment and unconstrained scene geometry wrt robot's base.
Our experiments use PerAct in an open-loop manner, where every prediction is with respect to the 1st scene and 1st position of the end-effector wrt base-frame. Hope that clarifies things some. Please do not hesitate to reach out again and tag me for details.
Hello,
Thank you for sharing your amazing research! Great work!
In regard to your paper on SLAP, can you please elaborate on how PerAct and SLAP were deployed with mobile manipulation? Specifically for PerAct, was the voxel grid size increased to cover the entire environment? And what are the camera views used to collect the data?