Scaling with Spark (PySpark)

Title

Scaling with Apache Spark (PySpark)

Summary

Running data analysis and ML on big data requires tools that support a cluster of machines. Spark is a popular framework that is widely adopted in the industry. PySpark is the Python language API to Spark. In this talk, we will understand how Spark helps crunch large data sets by scaling out to many machines. We'll examine how Spark's architecture and execution model help acheive this. We'll also cover key Spark concepts (like lazy evaluation and immutability) along the way. We'll close with a brief discussion on the performance implication of using PySpark instead of the native Spark Scala API.

Proposed contents

Quick overview of Big Data and how it led to Spark
What is Spark (need, distinctive features, metrits)
Spark: Architecture
Spark: Execution Model (illustrated with simple examples)
Performance: Spark Scala or PySpark?

I'm open, and looking forward, to suggestions to include anything else.

Proposed duration

~40 mins (tentative)

Talk Scheduling

Some time in Q2 2020 works best for me.

PyDataPune / Talks