PyDataPune / Talks

Official repo for proposals
14 stars 3 forks source link

Scaling with Spark (PySpark) #57

Open Sarkutz opened 4 years ago

Sarkutz commented 4 years ago

Title

Scaling with Apache Spark (PySpark)

Summary

Running data analysis and ML on big data requires tools that support a cluster of machines. Spark is a popular framework that is widely adopted in the industry. PySpark is the Python language API to Spark. In this talk, we will understand how Spark helps crunch large data sets by scaling out to many machines. We'll examine how Spark's architecture and execution model help acheive this. We'll also cover key Spark concepts (like lazy evaluation and immutability) along the way. We'll close with a brief discussion on the performance implication of using PySpark instead of the native Spark Scala API.

Proposed contents

I'm open, and looking forward, to suggestions to include anything else.

Proposed duration

~40 mins (tentative)

Talk Scheduling