datastackacademy / deb-archive

TuraLabs FREE Data Engineering Bootcamp
Apache License 2.0
3 stars 6 forks source link

175443218 spark intro #11

Closed tura-david closed 3 years ago

tura-david commented 3 years ago

Wrote the first two Spark intro notebooks (took me long enough, eh?). @jendevelops can you give them a review to see if they make any sense?

xdatanomad commented 3 years ago

@tura-david Can you add the install_jupyter.sh script from ch1/pandas-intro over? It could help users to install jupyter if they don't already.

xdatanomad commented 3 years ago

Could you please also add a requirements.txt installing on the pip packages needed for this episode?

tura-david commented 3 years ago

@tura-david Can you add the install_jupyter.sh script from ch1/pandas-intro over? It could help users to install jupyter if they don't already.

Good point! I'll get it added.

tura-david commented 3 years ago

Could you please also add a requirements.txt installing on the pip packages needed for this episode?

Just double checked and the only import I'm using is pyspark which I discuss installing in the first section of the intro. As such, I'm not going to make a separate requirements.txt unless there is another library being used.

xdatanomad commented 3 years ago

@tura-david I think the scope for the intro notebook is perfect.

The scope for the RDDDataFrames notebook could add things like:

  1. You have column data type transformations, let's add a transformation for a single column value like stripping off the letter 'N' from tailnumbers
  2. Add a transformation that requires more than one column. For example, calculate the average flight speed as a column. That's flight distance / flight time
  3. It would be great if we can add a join or lookup. You can export the BigQuery airports table and look up the IATA (3 letter) airport code to look up source and destination city, state names and add it to your flights RDD
  4. Throw in a couple more complex aggregates. Like Average flight time/distance per airline, or average flight/distance per destination, etc...
xdatanomad commented 3 years ago

Could you please also add a requirements.txt installing on the pip packages needed for this episode?

Just double checked and the only import I'm using is pyspark which I discuss installing in the first section of the intro. As such, I'm not going to make a separate requirements.txt unless there is another library being used.

You already have jupyter and you probably going to install other ones added later (for bigquery stuff). It makes it easier mainly for us who switch between developing different lessons

tura-david commented 3 years ago

Could you please also add a requirements.txt installing on the pip packages needed for this episode?

Just double checked and the only import I'm using is pyspark which I discuss installing in the first section of the intro. As such, I'm not going to make a separate requirements.txt unless there is another library being used.

You already have jupyter and you probably going to install other ones added later (for bigquery stuff). It makes it easier mainly for us who switch between developing different lessons

I already added the install_jupyter.sh; how about I add a 'install pyspark' line like the install pandas already included? That way we have the necessary modules for easy switching like you say but follow the same model as the pandas intro?

xdatanomad commented 3 years ago

Could you please also add a requirements.txt installing on the pip packages needed for this episode?

Just double checked and the only import I'm using is pyspark which I discuss installing in the first section of the intro. As such, I'm not going to make a separate requirements.txt unless there is another library being used.

You already have jupyter and you probably going to install other ones added later (for bigquery stuff). It makes it easier mainly for us who switch between developing different lessons

I already added the install_jupyter.sh; how about I add a 'install pyspark' line like the install pandas already included? That way we have the necessary modules for easy switching like you say but follow the same model as the pandas intro?

Sounds good.

tura-david commented 3 years ago

Overall you do a really good job of what Spark is and how to use it, but I feel like the "why" is missing for the introduction to Spark. I feel like I can't really answer the question "why am I learning Spark?" Also just made some typo edits.

Good point; this will be a separate PR by @parvister