General-purpose pre-trained models ("foundation models") have enabledpractitioners to produce generalizable solutions for individual machinelearning problems with datasets that are significantly smaller than thoserequired for learning from scratch. Such models are typically trained on largeand diverse datasets with weak supervision, consuming much more training datathan is available for any individual downstream application. In this paper, wedescribe the Visual Navigation Transformer (ViNT), a foundation model that aimsto bring the success of general-purpose pre-trained models to vision-basedrobotic navigation. ViNT is trained with a general goal-reaching objective thatcan be used with any navigation dataset, and employs a flexibleTransformer-based architecture to learn navigational affordances and enableefficient adaptation to a variety of downstream navigational tasks. ViNT istrained on a number of existing navigation datasets, comprising hundreds ofhours of robotic navigation from a variety of different robotic platforms, andexhibits positive transfer, outperforming specialist models trained on singulardatasets. ViNT can be augmented with diffusion-based subgoal proposals toexplore novel environments, and can solve kilometer-scale navigation problemswhen equipped with long-range heuristics. ViNT can also be adapted to noveltask specifications with a technique inspired by prompt-tuning, where the goalencoder is replaced by an encoding of another task modality (e.g., GPSwaypoints or routing commands) embedded into the same space of goal tokens.This flexibility and ability to accommodate a variety of downstream problemdomains establishes ViNT as an effective foundation model for mobile robotics.For videos, code, and model checkpoints, see our project page athttps://visualnav-transformer.github.io.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)