With the new exercises, we're not covering some of the more interesting Spark functions.
We'll create a new exercise for "Additional Spark Functions" (in the small-exercises repo) to cover the following:
DataFrame Cleaning
[ ] na.drop
[ ] na.fill
[ ] replace
[ ] coalesce
DataFrame Queries
[ ] select + array_contains
Aggregations
[ ] stddev
[ ] variance
[ ] mean
String Operations
[ ] regexp_replace
[ ] regexp_extract
For everything else
[ ] UDF
CFRs
[ ] All functions should have a solution (in a separate Solutions notebook)
[ ] All functions should have a link to the documentation for PySpark
Notes
We might be able to reuse some of the examples we had in the Wrangling with Spark exercise, but do it better. If there's an opportunity to use our domain data, that would be best but we might need to dirty up some data and save it as a CSV or something in the repo in order to pull it in
Open Questions
Are these valuable?
cache
unpersist
createOrReplaceGlobalTempView
createOrReplaceTempView
Should all functions have a test? Perhaps we can do it later?
With the new exercises, we're not covering some of the more interesting Spark functions.
We'll create a new exercise for "Additional Spark Functions" (in the small-exercises repo) to cover the following:
DataFrame Cleaning
DataFrame Queries
Aggregations
String Operations
For everything else
CFRs
Notes
We might be able to reuse some of the examples we had in the Wrangling with Spark exercise, but do it better. If there's an opportunity to use our domain data, that would be best but we might need to dirty up some data and save it as a CSV or something in the repo in order to pull it in
Open Questions
Are these valuable?
Should all functions have a test? Perhaps we can do it later?