CorrelAid / spark_workshop

MIT License
1 stars 0 forks source link

Lag local setup #1

Closed LAG1819 closed 5 months ago

jbao commented 5 months ago

I like the general flow of the exercise very much, and have a couple of additional suggestions.

Introduce partitions

We can show how to check and change the number of partitions, e.g.

df.rdd.getNumPartitions()
df = df.repartition(5)
df.rdd.getNumPartitions()
df.withColumn("partitionId", spark_partition_id()).groupBy("partitionId").count().show()
df.count()

Demonstrate lazy evaluation

We can show (e.g. in Task 8) all the transformation steps initially do nothing and return immediately. The actual computation only happens when count, show etc. are called.

Show one use case of the Spark UI

Ask the participants to navigate through the Spark UI when some transformation job is running (e.g. Task 8), count the number of stages, what each one does, and how long.

Let me know your thoughts on the above. I can also try to add them to the notebook if we agree. In general, I think the amount of exercises will most likely exceed the workshop duration, but they're well organized with an increase of difficulty, so I think we can set the expectation that one should try to solve them in the defined order, and as many as one can and the time permits.

LAG1819 commented 5 months ago

I like the suggestions. I would recommend to add these tasks in a new branch as this one here is currently working and contains also the basic setup.