Foundation Policies with Hilbert Representations

chufanchen commented 8 months ago

Introduction

Appealing recipe: Pre-train via self-supervised or unsupervised objectives on large and diverse datasets without ground truth labels. Adapt to downstream tasks via prompting, few-shot learning, or fine-tuning. The pre-trained generalist policies are called foundation policies.

Open question in RL: finding the best policy pre-training objective from data.

Prior works: BC(requires expert demonstrations), GCRL, unsupervised skill discovery

Related work:

Representation learning
Unsupervised policy pre-training

chufanchen commented 8 months ago

Preliminaries

MDP

Offline RL

Unlabeled trajectory data $\mathcal{D}$, which consists of state-action trajectories.

Pretrain $\pi(a\vert s,z)$, where $z \in \mathcal{Z}$ denotes a latent vector(task or skill).

Hilbert space

A linear space with an inner product is called an inner product space or pre-Hilbert space. Any inner product space is a normed linear space.

A Hilbert space $\mathcal{Z}$ is a complete vector space equipped with an inner product $\langel x ,y \rangel$, the induced norm $\vert x\vert=\sqrt{\langel x, x\rangel}$, and the induced metric $d(x, y)=\Vert x-y\Vert$ for $x, y \in \mathcal{Z}$.

In particular, every Hilbert space is a Banach space with respect to the norm. A Banach space is a normed linear space that is a complete metric space with respect to the metric derived from its norm.

chufanchen commented 8 months ago

Method

Hilbert Representation

Train $\phi : \mathcal{S} \rightarrow \mathcal{Z}$

$$ d^*(s,g)=\Vert \phi(s) - \phi(g) \Vert $$

chufanchen / read-paper-and-code