EPIC: Design and decisions

This is the current design that contains major decisions for the project. Additional future work and improvements that are not part of this design are listed in https://github.com/AndrejOrsula/drl_grasping/issues/51 and might be eventually included in the implementation if time allows it.

Setup

Reproducible setup in a simulation and test setup in real life. Ignition Gazebo was selected as the robotics simulator and is used for training the RL agent.

Robotic arm
- Equipped with a mechanical gripper
- Franka Emika Panda with a parallel two-finger gripper
- Ignition model - panda_ign
RGB-D camera
- Oriented towards objects
- Ignition model - generic rgbd_camera sensor
- RealSense D4XX camera will be used for real evaluation
Objects
- Variety of shapes, sizes and materials
- Placed randomly on top of horizontal surface
- Ignition models - Google Scanned Objects collection
- Real-life test objects - idk, something random

Task

Grasping in its simplest can be conceptually decoupled into these sub-routines. RL agent should aspire to learn steps 1-3, with the 4th step being determined by surrounding application, e.g. success or max number of steps. Agent can learn additional steps through exploration, e.g. push and pull objects in order to provide better grasping conditions.

Move end effector (gripper) to pre-grasp pose
- Pose must be determined from sensory observations
Close gripper
Lift object above the supporting surface
- Make sure the grasp is secure
Terminate and allow other tasks/processes to execute
- Outside of scope for agent's policy, but it needs to be determined both in simulation and real-life

These stages slightly inspired stages in curriculum learning, see https://github.com/AndrejOrsula/drl_grasping/issues/62.

Control loop

The main control loop of the agent runs at relatively low frequency (~2.5Hz)
Low-level controllers, e.g. joint PIDs, and sensors, e.g. RGB-D camera, run at higher update rate than the agent (~200Hz for control, >=15Hz for sensing)

Get observations
Predict actions
Execute actions (simultaneously)
- Move arm to the new configuration
- Execute gripper action
  - This action might be executed much faster than arm movement
Repeat until termination (success or max steps)

Another approach would be to decompose the task to sensing, planning, execution (e.g. robot action would consist only of grasp pose and everything else would be performed outside the agent's policy), or remove gripper action from the control loop and perform grasp once episode is terminated or a certain Z position is reached, e.g. https://arxiv.org/pdf/1802.10264.pdf. However, the selected 'dynamic' closed-loop control was selected as it resembles what humans do more closely.

RL Algorithm

Decided to use Truncated Quantile Critics (TQC), derived from Soft Actor-Critic (SAC)
- TQC - https://arxiv.org/abs/2005.04269
- SAC - https://arxiv.org/abs/1812.05905 (refined paper)
- SAC - https://arxiv.org/abs/1801.01290 (first paper)
- Using stable-baselines3 implementation (TQC is currently in sb3-contrib)

Actions

List of actions that the agent is allowed to take that must provide the ability to accomplish task successfully. All of these will be part of a single action-space vector.

End effector pose

Position
- Absolute/Relative
- [ ] ~~Absolute~~ (direct target, in world frame)
- [x] Relative (relative target, in end effector frame)
  - Selected because it is much more popular in literature and can be specified with normalised limits
Orientation
- Absolute/Relative
- [ ] ~~Absolute~~
- [x] Relative
- Number of DOF - 1D (around Z) || full 3D
  1. Use orientation only around Z at first (more popular in literature)
  2. Then try to use full 3D (Note: Implemented, but not used yet... might try later)
- Representation (3D)
- [ ] Quaternion
- [ ] Rotation matrix
- [x] "6D representation"

Position (Relative)

Dimension
- (x, y, z)
Limits
- Normalised [-1, 1]
- Scaled into smaller metric steps before use, e.g. [-0.1 m, 0.1 m]

Orientation (Relative)

Dimension
- z || (x, y, z, w) || R[3x3] || v₁(3x1), v₂(3x1)
Limits
- Normalised [-1, 1]
- Converted into normalised quaternion before use (no matter the original representation)

Gripper

Absolute/Relative
- [x] Absolute (desired gripper state)
- This seems to be preferred in literature, i.e. a binary signal for open/close
- [ ] Relative (change in gripper state)

Gripper (Absolute)

Signals
- Action
- Open/close
- ~~Force~~ (optional, not needed in simplified case - use max, added to Future Works (FW) https://github.com/AndrejOrsula/drl_grasping/issues/51)
- ~~Width~~ (optional, not needed in simplified case - use min/max)
- Applicable only when closing
Dimension
- Action can be encoded as one-hot vector (open, close) as in https://arxiv.org/pdf/1806.10293.pdf or scalar
- Scalar for each other signal
Limits
- Normalised [-1, 1]
- Remapped to account for limits of the signals, e.g. max force

Gripper (Relative)

Signals
- Width
- ~~Force~~ (optional, not needed in simplified case - use max)
- Applicable only when closing, i.e. width < 0
Dimension
- Scalar for each signal
Limits
- Normalised [-1, 1]
- Remapped to account for limits of the signals, e.g. min/max width

Observations

Octree of the scene

Constructed from point cloud
- This point cloud needs to be
- Transformed into robot coordinate frame
- Cropped to a bounding-box with consistent size (preferable aspect ratio of 1:1:1)
  - For now, restrict this bounding-box to a volume above ground plane
  - It should be possible to extend this to the entire reachable workspace of the robot (added to FW https://github.com/AndrejOrsula/drl_grasping/issues/51)
Features (more than one can be used)
- [x] Normals
- (x, y, z)
- [x] Distance to average point position (w.r.t. octet centre)
- [x] Color
- (r, g, b)
Extra?
- [ ] Position of octree centre/corner w.r.t robot base frame (added to FW https://github.com/AndrejOrsula/drl_grasping/issues/51)

End effector pose

Position
Orientation
- Representation
- [ ] Quaternion
- [ ] Rotation matrix
- [x] "6D representation"

Position

Dimension
- (x, y, z)
Not normalised

Orientation

Dimension -v₁(3x1), v₂(3x1)
Normalised

Gripper state

Width
- Normalised
- This will be identical to state (open/close) if binary actions are used

Reward function

Ongoing epic: https://github.com/AndrejOrsula/drl_grasping/issues/41

Dense/Sparse
- [ ] Dense (shaped)
- [x] Sparse (shaped)

Sparse (shaped)

reward multiplier r (currently r = 4.0)

+r^0 for reaching object (within 10cm)
+r^1 for touching object
+r^2 for grasping object
+r^3 for lifting object (above 15cm, terminates episode)
-1.0 for touching ground (terminates episode)
0.0 for pushing all objects outside of workspace (terminates episode)
-0.005 for all time steps

Policy (network architecture)

Currently, using depth=4 and full_depth=2

Feature Extractor (shared between actor and critics):

Octree:
- [Conv3D -> (BatchNorm) -> ReLU] * (depth-full_depth) -> Conv1D/Conv3D -> Octree2Voxel -> Flatten -> Linear
- BatchNorm is optional
- Conv1D is default, but can be replaced by Conv3D by setting argumeng
- See source code for OctreeCnnFeatureExtractor for more details. e.g number of channels.
Auxiliary (proprioceptive) observations:
- Linear
Octree and auxiliary features are concatenated together Actor and Critics:
[Linear -> ReLU] * 2

Domain randomisation

Currently, the following domain randomisation can be applied in the simulation

Random objects from Google Scanned Objects collection (80 objects for training, 20 for testing)
- Random mass (with limits)
- Random surface friction (with limits)
- Random pose (above ground plane)
Random texture of the ground plane (80 textures for training, 20 for testing)
Random initial configuration for the robot
Random perspective of the camera

AndrejOrsula / drl_grasping