[REA] Benchmark Data Loader training from Cloud Storage

NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

Apache License 2.0

1.04k stars 143 forks source link

[REA] Benchmark Data Loader training from Cloud Storage #1087

Open bschifferer opened 3 years ago

bschifferer commented 3 years ago

What questions are you trying to answer? Please describe. NVTabular data loader accelerates training of TensorFlow and PyTorch models. Benchmarks are based on local disks. In production systems, data is often stored on cloud storages, such as AWS S3 or Google Cloud Storage. NVTabular data loader can asynchronously stream data and NVTabular supports AWS S3 and Google Cloud Storage.

Question:

What are the throughput of training a deep learning model? Data is stored on cloud storage and streamed from there? Data is copied to local disk and streamed from local disk? Compare both, AWS S3 and Google Cloud Storage Both frameworks: PyTorch and TensorFlow

Follow Up Question:

How to handle data sets with 500GB - 1TB dataset size? Is most efficient way to copy all data to a single machine and train? Is training directly from cloud storage preferable?

karlhigley commented 3 years ago

I don't have exact number, but IIRC it's much faster to download the data to local disk and train from there. If the data fits on disk on a single machine, that's the recommended approach.

We don't have a good solution for datasets that are too large to fit on one machine. Ideally the files from cloud storage would be partitioned across machines, so that each machine could download a subset of the data to local storage and train using that subset. We don't yet support that in the multi-GPU functionality of the dataloaders though. (We've only handled the multi-GPU single machine case so far.)

bschifferer commented 3 years ago

As the data loader can load asynchronously the data, could it not be faster to "train directly from storage" than "copying all the data without training and then start the training process?". In the case of large datasets like 100GBs to 500GBs, it should be more efficient as the training process does not idle during the copying (if only training for 1 epoch).

This question was asked multiple times. I dont have any experiment/experience with it. It would be great to provide numbers.

karlhigley commented 3 years ago

That might be possible, depending on how the data is split into files in cloud storage. I don't think the dataloaders support it yet though.