Open bkpcoding opened 3 months ago
Location in document: undefined
Selected HTML:
We model the learning task as a Markov Process (MP) defined by of states , actions , and the transition probability . Note that, unlike RL, we do not have access to an environment reward. We have access to human expert demonstrations in the training environment consisting of a set of trajectories, , of states and actions at every time step . The underlying expert policy, , is unknown. The goal is to learn the agentโs policy , that best approximates the expert policy . Each expert trajectory , consists of states and action pairs:
(1) |
The human decision-making process of the expert is unknown and likely non-Markovian [32]; thus, imitation learning performance can deteriorate with human trajectories [6].
The Behavior Transformer (BeT) processes the trajectory as a sequence of 2 types of inputs: states and actions. The original BeT implementation [11] employed a mixture of Gaussians to model a dataset with multimodal behavior. For simplicity and to reduce the computational burden, we instead use an unimodal BeT that uses a deterministic similar to the one originally used by the Decision Transformer [7]. However, since residual policies can be added to black-box policies [14], BeTAILโs residual policy could be easily added to the k-modes present in the original BeT implementation.
Hello @bkpcoding, thanks for the issue report! We are reviewing your report and will address it as soon as possible.
Description
black font on black background makes it impossible to read.
(Optional:) Please add any files, screenshots, or other information here.
No response
(Required) What is this issue most closely related to? Select one.
Choose One
Internal issue ID
2f14f2ae-697f-4aa2-bb7c-90f4ecbb9ea9
Paper URL
https://arxiv.org/html/2402.14194v1
Browser
Chrome/126.0.0.0
Device Type
Desktop