databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
1.99k stars 494 forks source link

PyTorch dataframe batching and loading #175

Open andompesta opened 5 years ago

andompesta commented 5 years ago

Dear all,

I really appreciate your effort in bridging the gaps between Spark and some automatic-differentiation libraries. However, I'm currently working with Spark and PyTorch, where Spark is used to process the dataset and PyTorch is used to build a neural model.

In my configuration, the model is present only on the driver node; while the dataset is distributed on the workers. I'm wandering which is the best practice to use to convert a Spark dataframe into a PyTorch tensor or a numpy array to generate a training batch. Does it make sense to generate the batches in a distributed way or it is better to collect everything on the driver node?

Best regards, Sandro

thunterdb commented 5 years ago

Hello Sandro, thanks for trying out our library. The focus has been primarily on keras and tensorflow for now, but we are looking at supporting pytorch eventually. In the meantime, you can use Spark's vectorized UDF that let you take pandas code and distribute this code with Spark: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

Can you share more about your dataset? What is the current format that you are using to pass it to pytorch? numpy arrays?

Tim

On Mon, Nov 26, 2018 at 8:32 AM Sandro Cavallari notifications@github.com wrote:

Dear all,

I really appreciate your effort in bridging the gaps between Spark and some automatic-differentiation libraries. However, I'm currently working with Spark and PyTorch, where Spark is used to process the dataset and PyTorch is used to build a neural model.

In my configuration, the model is present only on the driver node; while the dataset is distributed on the workers. I'm wandering which is the best practice to use to convert a Spark dataframe into a PyTorch tensor or a numpy array to generate a training batch. Does it make sense to generate the batches in a distributed way or it is better to collect everything on the driver node?

Best regards, Sandro

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/databricks/spark-deep-learning/issues/175, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPjAbwUXs5QFPw5r_25bdAsXFYXF7kZks5uy6cbgaJpZM4YyjD6 .

andompesta commented 5 years ago

So, first of all, a small disclaimer: I'm new to Spark and PySpark environment, so I'm looking for some best practices and guidance to follow during coding.

Anyway, I'm currently working on a network connection dataset for intrusion detection (each datapoint represent a connection that happens between a client and a server). The scenario looks as follow: 1) during training the entire dataset is collected on the driver node to train the model with traditional batching 2) during testing the dataset comes in from a real-time stream. PySpark provides great interfaces to compute rolling windows statistics over previous micro-batches; thus the same preprocessing code used for the testing dataset is applied to the testing stream.

However, bridging PySpark and PyTorch is not that easy. Currently, I'm using the following approach: 1) I use a PySpark Pipeline to extract the needed features 2) A VectorAssembler is used to generate a feature vector. Unfortunately, due to the previous step, I obtain sparse feature vectors as output. 3) Thus, I convert the SparseVectors to DenseVectors. Not sure why, but the same UDF that maps a SparseVector to a DenseVector does not work on a dataframe, but works correctly on an RDD format. 4) At this point, I convert the RDD in a Pandas dataframe containing the labels and the extracted features. This step is straightforward but could be improved. That is, the generated Pandas dataframe instead of containing native Numpy arrays contains DenseVector objects. These objects are the DenseVectors generated by the VectorAssembler. Converting an object array to a Numpy.float32 can be time-consuming and results in "ugly code". In future releases, I hope to see a direct map between Spark Vectors and Numpy types. However, the strong typing imposed by PyTorch comes to help us. If we directly convert the pandas Series into a tensor, we obtain an array of floats rather than the original objects. torch.tensor(pd_df.features.apply(lambda x: x.toArray()), dtype=torch.float) 5) Obtained the needed tensors we can train a PyTorch model using the provided DataLoader class

Thanks for the help. Sandro