PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.32k stars 5.63k forks source link

Dataset on cloud - allow user to pre-partition #1914

Closed helinwang closed 7 years ago

helinwang commented 7 years ago

For dataset on the cloud, we currently only allow the user to upload the dataset and an index file, we will do the dataset partition for the user. (according to https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/data_dispatch.md#上传训练文件 )

@emailweixu suggested that user may want to customize how data is partitioned, In this way, the user doesn't have to change their reader (before reader takes the filename as the input, on the cloud it still takes the filename as the input.). So I think we will need to allow the user to upload pre-partitioned data (partitioned into multiple files), and the master process will dispatch data using each file as the minimal unit.

helinwang commented 7 years ago

This is already in design doc: https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/data_dispatch.md