@emailweixu suggested that user may want to customize how data is partitioned, In this way, the user doesn't have to change their reader (before reader takes the filename as the input, on the cloud it still takes the filename as the input.). So I think we will need to allow the user to upload pre-partitioned data (partitioned into multiple files), and the master process will dispatch data using each file as the minimal unit.
For dataset on the cloud, we currently only allow the user to upload the dataset and an index file, we will do the dataset partition for the user. (according to https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/data_dispatch.md#上传训练文件 )
@emailweixu suggested that user may want to customize how data is partitioned, In this way, the user doesn't have to change their reader (before reader takes the filename as the input, on the cloud it still takes the filename as the input.). So I think we will need to allow the user to upload pre-partitioned data (partitioned into multiple files), and the master process will dispatch data using each file as the minimal unit.