This is the current design that contains major decisions for the project. Additional future work and improvements that are not part of this design are listed in https://github.com/AndrejOrsula/drl_grasping/issues/51 and might be eventually included in the implementation if time allows it.
Setup
Reproducible setup in a simulation and test setup in real life. Ignition Gazebo was selected as the robotics simulator and is used for training the RL agent.
Grasping in its simplest can be conceptually decoupled into these sub-routines. RL agent should aspire to learn steps 1-3, with the 4th step being determined by surrounding application, e.g. success or max number of steps. Agent can learn additional steps through exploration, e.g. push and pull objects in order to provide better grasping conditions.
Move end effector (gripper) to pre-grasp pose
Pose must be determined from sensory observations
Close gripper
Lift object above the supporting surface
Make sure the grasp is secure
Terminate and allow other tasks/processes to execute
Outside of scope for agent's policy, but it needs to be determined both in simulation and real-life
The main control loop of the agent runs at relatively low frequency (~2.5Hz)
Low-level controllers, e.g. joint PIDs, and sensors, e.g. RGB-D camera, run at higher update rate than the agent (~200Hz for control, >=15Hz for sensing)
Get observations
Predict actions
Execute actions (simultaneously)
Move arm to the new configuration
Execute gripper action
This action might be executed much faster than arm movement
Repeat until termination (success or max steps)
Another approach would be to decompose the task to sensing, planning, execution (e.g. robot action would consist only of grasp pose and everything else would be performed outside the agent's policy), or remove gripper action from the control loop and perform grasp once episode is terminated or a certain Z position is reached, e.g. https://arxiv.org/pdf/1802.10264.pdf.
However, the selected 'dynamic' closed-loop control was selected as it resembles what humans do more closely.
RL Algorithm
Decided to use Truncated Quantile Critics (TQC), derived from Soft Actor-Critic (SAC)
Using stable-baselines3 implementation (TQC is currently in sb3-contrib)
Actions
List of actions that the agent is allowed to take that must provide the ability to accomplish task successfully. All of these will be part of a single action-space vector.
End effector pose
Position
Absolute/Relative
[ ] Absolute (direct target, in world frame)
[x] Relative (relative target, in end effector frame)
Selected because it is much more popular in literature and can be specified with normalised limits
Orientation
Absolute/Relative
[ ] Absolute
[x] Relative
Number of DOF - 1D (around Z) || full 3D
Use orientation only around Z at first (more popular in literature)
Then try to use full 3D (Note: Implemented, but not used yet... might try later)
Representation (3D)
[ ] Quaternion
[ ] Rotation matrix
[x] "6D representation"
Position (Relative)
Dimension
(x, y, z)
Limits
Normalised [-1, 1]
Scaled into smaller metric steps before use, e.g. [-0.1 m, 0.1 m]
Orientation (Relative)
Dimension
z || (x, y, z, w) || R[3x3] || v1(3x1), v2(3x1)
Limits
Normalised [-1, 1]
Converted into normalised quaternion before use (no matter the original representation)
Gripper
Absolute/Relative
[x] Absolute (desired gripper state)
This seems to be preferred in literature, i.e. a binary signal for open/close
This is the current design that contains major decisions for the project. Additional future work and improvements that are not part of this design are listed in https://github.com/AndrejOrsula/drl_grasping/issues/51 and might be eventually included in the implementation if time allows it.
Setup
Reproducible setup in a simulation and test setup in real life. Ignition Gazebo was selected as the robotics simulator and is used for training the RL agent.
Task
Grasping in its simplest can be conceptually decoupled into these sub-routines. RL agent should aspire to learn steps 1-3, with the 4th step being determined by surrounding application, e.g. success or max number of steps. Agent can learn additional steps through exploration, e.g. push and pull objects in order to provide better grasping conditions.
These stages slightly inspired stages in curriculum learning, see https://github.com/AndrejOrsula/drl_grasping/issues/62.
Control loop
Another approach would be to decompose the task to sensing, planning, execution (e.g. robot action would consist only of grasp pose and everything else would be performed outside the agent's policy), or remove gripper action from the control loop and perform grasp once episode is terminated or a certain Z position is reached, e.g. https://arxiv.org/pdf/1802.10264.pdf. However, the selected 'dynamic' closed-loop control was selected as it resembles what humans do more closely.
RL Algorithm
stable-baselines3
implementation (TQC is currently insb3-contrib
)Actions
List of actions that the agent is allowed to take that must provide the ability to accomplish task successfully. All of these will be part of a single action-space vector.
End effector pose
Absolute(direct target, in world frame)AbsolutePosition (Relative)
Orientation (Relative)
Gripper
Gripper (Absolute)
Force(optional, not needed in simplified case - use max, added to Future Works (FW) https://github.com/AndrejOrsula/drl_grasping/issues/51)Width(optional, not needed in simplified case - use min/max)Gripper (Relative)
Force(optional, not needed in simplified case - use max)Observations
Octree of the scene
End effector pose
Position
Orientation
Gripper state
Reward function
Ongoing epic: https://github.com/AndrejOrsula/drl_grasping/issues/41
Sparse (shaped)
reward multiplier r (currently
r = 4.0
)Policy (network architecture)
Currently, using
depth=4
andfull_depth=2
Feature Extractor (shared between actor and critics):
OctreeCnnFeatureExtractor
for more details. e.g number of channels.Domain randomisation
Currently, the following domain randomisation can be applied in the simulation