Leverage Ray from Berkeley for Distributed Training

DEMOCRATIZING PRODUCTION-SCALE DISTRIBUTED DEEP LEARNING

https://arxiv.org/pdf/1811.00143.pdf

To address the above challenges, we discuss a system webuilt at Apple known asAlchemist. Alchemist adopts acloud-native architecture and is portable among private andpublic clouds. It supports multiple training frameworkslike Tensorflow or PyTorch and multiple distributed trainingparadigms. The compute cluster is managed by, but not lim-ited to, Kubernetes2. We chose a containerized workflowto ensure uniformity and repeatability of the software envi-ronment. In the following sections, we refer to engineers,researchers, and data scientists using Alchemist asusers.

JonathanChiang / eDash

Leverage Ray from Berkeley for Distributed Training #5