Open chufanchen opened 8 months ago
Appealing recipe: Pre-train via self-supervised or unsupervised objectives on large and diverse datasets without ground truth labels. Adapt to downstream tasks via prompting, few-shot learning, or fine-tuning. The pre-trained generalist policies are called foundation policies.
Open question in RL: finding the best policy pre-training objective from data.
Prior works: BC(requires expert demonstrations), GCRL, unsupervised skill discovery
Related work:
Offline RL
Unlabeled trajectory data $\mathcal{D}$, which consists of state-action trajectories.
Pretrain $\pi(a\vert s,z)$, where $z \in \mathcal{Z}$ denotes a latent vector(task or skill).
A linear space with an inner product is called an inner product space or pre-Hilbert space. Any inner product space is a normed linear space.
A Hilbert space $\mathcal{Z}$ is a complete vector space equipped with an inner product $\langel x ,y \rangel$, the induced norm $\vert x\vert=\sqrt{\langel x, x\rangel}$, and the induced metric $d(x, y)=\Vert x-y\Vert$ for $x, y \in \mathcal{Z}$.
In particular, every Hilbert space is a Banach space with respect to the norm. A Banach space is a normed linear space that is a complete metric space with respect to the metric derived from its norm.
Train $\phi : \mathcal{S} \rightarrow \mathcal{Z}$
$$ d^*(s,g)=\Vert \phi(s) - \phi(g) \Vert $$
https://arxiv.org/abs/2402.15567