datacamp / data-cleaning-with-pyspark-live-training

Live Training Session: Cleaning Data with Pyspark
14 stars 16 forks source link

Notebook Review #3

Open adelnehme opened 4 years ago

adelnehme commented 4 years ago

DataCamp icon

Hi @mmetzger :wave: Please read the key below to understand how to respond to the feedback provided. Some items will require you to take action while others only need some thought. During this round of feedback, each item with an associated checkbox is an action item that should be implemented before you submit your content for review.


Key

:wrench: This must be fixed during this round of review. This change is necessary to create a good DataCamp live training.

:mag: This wasn't exactly clear. This will need rephrasing or elaborating to fully and clearly convey your meaning.

📣 This is something you should take a lookout for during the session and verbally explain once you arrive at this point.

:star: Excellent work!


General Feedback

This is looking like a great notebook! I want to caveat that I'm not a PySpark expert - so a lot of my questions/comments may come from that place and I'll be relying on your expertise for guidance 😄


Notebook Feedback

Getting Started

I think this code section should remain populated for the empty session (or student) notebook.

Brief description on what we're doing and that we will be using shell commands to get the data

  • wget does..
  • ls does..
  • gunzip does..

Loading our initial dataframe

Let's take a look at what Spark does with our data and see if it can properly parse the output. To do this, we'll first load the content into a DataFrame using the spark.read.csv() method. We'll pass in three arguments:

  • The path to the file(s)
  • The header argument which takes in True or False. Since our files do not have a header row, so we must specify header = False or risk a data row being interpreted as a header.
  • The sep argument which specifies the field separator. Often in CSV files, this is a comma ,, but in our files, it's a \t or tab character.

Intitial analysis

  • Identifying corrupt rows that start with comments and keeping in mind the field count as well
  • Identifying that the column names need to be changed
  • Identifying incorrect data types

With that in light, I think using the "Observation #: " format where after each dirty data problem diagnosis we make, we add "Observation #: Diagnosis". For example, after having run titles_df.show(150, truncate=False)- We can insert in markdown directly afterward:

Observation 1: Taking a look at the first column, most entries are numeric IDs. If we scroll through the data we see at least one random text entry. Something must be out of the ordinary.

The same can be done after each diagnosis step you take.

Bypasssing CSV interpreter

"Let's count correct instances of the first column by casting it as an integer and trying to find all values that are not null. To do this, we will use the functions submodule from the pyspark.sql module to filter on incorrect data. We can apply changes to a particular column by using the .col() method, change its type chaining it with .cast() - and find not null values by chaining with .isNotNull() -- All of this is couched inside of the DataFrame .filter() method. Here's how it would look like on an example of filtering on a PySpark dataframe named df_1 with a column named _c0:

from pyspark.sql import functions as F

df_1.filter(F.col("_c0").cast("int").isNotNull()).show())

Cleaning up our data

Note: We're doing things in a more difficult fashion than is absolutely necessary to illustrate options. The Spark CSV reader has an option for a comment property, which actually defaults to skipping all rows starting with a # character. That said, it only supports a single character - consider if you were looking for multi-character options (ie, a // or /* from C-style syntax). This feature is also only available in newer versions of Spark, where our method works in any of the 2.x Spark releases.

Checking column counts

Creating typed columns

... The .getItem() method returns the value at the specified index of the listcolumn (ie, of splitcolumn). For example, here is this applied on an example DataFrame df_1 with a listcolumn named list_column containing 3 elements as seen in the table below:

image

df_1 = df_1.withColumn('id', df_1.list_column.getItem(0).cast(IntegerType()))

Even more cleanup

UDF

Saving data for analysis / further processing

mmetzger commented 4 years ago

Hi @adelnehme,

Thank you for the extensive feedback - I think I've implemented everything accordingly - further comments inline.

General Feedback

This is looking like a great notebook! I want to caveat that I'm not a PySpark expert - so a lot of my questions/comments may come from that place and I'll be relying on your expertise for guidance 😄

I've specifically added some portions here that I hope will address some of the oddities of Spark. Originally this class was pitched as kind of taking this after taking my Cleaning Data with PySpark course, so I assumed the learners would have the appropriate levels of Spark knowledge. That said, I think it's a better notebook adding it, though there are likely places that are still confusing to someone unfamiliar with Spark. Please let me know if anything seems odd.

  • [x] 🔧 A. Make sure to bold all sections, sub-sections and sub-sub sections.
  • [x] 🔧 B. In many sections, I gave the feedback of using "Observation #: " and "Data Problem #:" - the reasoning behind this is because it will help you connect back problems to observations when going about your data cleaning problems. Check out how this is done in Data Cleaning with Python solution notebook

I think I implemented this ok, though it does not flow as well as the example above. The primary reason is it's difficult to determine all the problems in one session. Please let me know if my version works, or what should be changed.

Notebook Feedback

Getting Started

I think this code section should remain populated for the empty session (or student) notebook.

  • [x] 🔧 1. It could be worthwhile here adding a brief markdown description on wget, ls and gunzip explaining that these are shell commands that we usually perform on the command line. Potentially adding a screenshot of a command line here would be great. I find using bullet point formats is clearer like this:

Brief description on what we're doing and that we will be using shell commands to get the data

  • wget does..
  • ls does..
  • gunzip does..
  • [x] 🔧 2. Recommend removing the sub-headers for verifying the 7 files and taking a look at the first 20 rows as all of these steps are part of the getting started section.
  • [x] 📣 3. It would be cool to point out what the * is doing when using ls and that this is the only use of shell expected from us in the session.
  • [x] 📣 4. Keep a lookout to point that the separator from the output of !gunzip -c /tmp/netflix_titles_dirty_03.csv.gz | head -20 is no a , but a tab.

Done for all

Loading our initial dataframe

  • [x] 🔧 5. Format the header here as a sub-section (##), and sub-sub-sections within it as sub-sub-sections (###).
  • [x] 🔧 6. I find it slightly clearer and less cognitive overload on students to use bullet points to outline arguments. For example:

Let's take a look at what Spark does with our data and see if it can properly parse the output. To do this, we'll first load the content into a DataFrame using the spark.read.csv() method. We'll pass in three arguments:

  • The path to the file(s)
  • The header argument which takes in True or False. Since our files do not have a header row, so we must specify header = False or risk a data row being interpreted as a header.
  • The sep argument which specifies the field separator. Often in CSV files, this is a comma ,, but in our files, it's a \t or tab character.

Done

  • [x] 🔍 📣 6. Is the command titles_df = spark.read.csv('/tmp/netflix_titles_dirty*.csv.gz', header=False, sep='\t') reading all the Netflix movie files? If so, I recommend highlighting that they these csv files are being coalesced together through the use of *.

Added, but with the initial explanation about how Spark deals with files. Basically, nothing happens here beyond Spark looking for the files to exist - ie, they're not read until some action is performed. It's a bit hard to illustrate with this size dataset and in this environment, but hopefully the explanation works. Otherwise I should be able to elaborate in the Q&A.

Intitial analysis

  • [x] 🔧 7. Use inline formatting around False.
  • [x] 🔍 8. It strikes as the initial analysis phase is about diagnosing some dirty data problems, for which they are:
  • Identifying corrupt rows that start with comments and keeping in mind the field count as well
  • Identifying that the column names need to be changed
  • Identifying incorrect data types

With that in light, I think using the "Observation #: " format where after each dirty data problem diagnosis we make, we add "Observation #: Diagnosis". For example, after having run titles_df.show(150, truncate=False)- We can insert in markdown directly afterward:

Observation 1: Taking a look at the first column, most entries are numeric IDs. If we scroll through the data we see at least one random text entry. Something must be out of the ordinary.

The same can be done after each diagnosis step you take.

Bypasssing CSV interpreter

  • [x] 🔧 9. Consider adding a cell where you just count the entire dataframe and see the total rows - preceded by a markdown cell slightly touching upon the .count() method (1 sentence really) at the beginning of this section not at the end.
  • [x] 🔧 10. After having counted the dataframe, it could be worth adding a markdown cell stating:

"Let's count correct instances of the first column by casting it as an integer and trying to find all values that are not null. To do this, we will use the functions submodule from the pyspark.sql module to filter on incorrect data. We can apply changes to a particular column by using the .col() method, change its type chaining it with .cast() - and find not null values by chaining with .isNotNull() -- All of this is couched inside of the DataFrame .filter() method. Here's how it would look like on an example of filtering on a PySpark dataframe named df_1 with a column named _c0:

from pyspark.sql import functions as F

df_1.filter(F.col("_c0").cast("int").isNotNull()).show())
  • [x] 🔧 11. After having exposed the problem, consider adding the format "Data Problem #: " format to diagnosed data problems. This will make it easier to connect it back to "Observations #:" and to solutions down the line.

Cleaning up our data

  • [x] 🔧 12. Consider setting this section in Italic:

Note: We're doing things in a more difficult fashion than is absolutely necessary to illustrate options. The Spark CSV reader has an option for a comment property, which actually defaults to skipping all rows starting with a # character. That said, it only supports a single character - consider if you were looking for multi-character options (ie, a // or /* from C-style syntax). This feature is also only available in newer versions of Spark, where our method works in any of the 2.x Spark releases.

Checking column counts

  • [x] 🔧 13. Consider using bullet points when outlining the functions used here and more accurate inline formatting for functions split().

Creating typed columns

  • [x] 🔧 14. Consider adding more markdown here where you introduce an example, for example:

... The .getItem() method returns the value at the specified index of the listcolumn (ie, of splitcolumn). For example, here is this applied on an example DataFrame df_1 with a listcolumn named list_column containing 3 elements as seen in the table below:

image

df_1 = df_1.withColumn('id', df_1.list_column.getItem(0).cast(IntegerType()))

I added a markdown table here to illustrate the before / after scenarios.

  • [x] 🔧 15. I get an error here stating "NameError: name 'IntegerType' is not defined"

Fixed - this was a missing import.

Even more cleanup

  • [x] 🔧 📣 16. Worth noting here that this is a categorical data column and it should have a set amount of categories and collapsing categorical values is a common data cleaning problem.

UDF

  • [x] 🔧 17. I get another error similar to 15 but about StringType
  • [x] 🔧 18. In the markdown introducing UDF - consider adding a "dummy" example similar to how I've been proposing with df_1 in 14 for example.

Done - the example may be a bit long. Please let me know if I should shorten it.

Saving data for analysis / further processing

  • [x] 🔧 19. Would it be possible to have some code that lets student download their files immediately? I tried using this but it doesn't work.

Heh, this was tricky on a couple fronts. First, when you save a "file" in Spark. it actually creates a directory with that name and saves all the partitions of data into their own separate files along with a couple of metadata files. These internal files have names like: part-00000-2ae69860-b62a-4086-966b-d59fe32777fe-c000.snappy.parquet. Beyond that, Parquet files aren't readable in normal desktop tools so even if they could download them, it wouldn't be much use.

As such, I did the following:

Thanks again for all the feedback and for making this a much better notebook!

Mike

adelnehme commented 4 years ago

Hi @mmetzger :wave:

Thanks for implementing the feedback -- it' looking really really solid! I've taken the liberty of taking in the solution notebook and creating out of it the student notebook -- here are a list of relevant links for you:

I recommend doing 1 small change to both of these notebooks:

mmetzger commented 4 years ago

Hi @adelnehme,

A clarification on this one - are you ok with the layout of the tasks being added to as the notebook progresses / the tasks being solved somewhat out of order? The modifications I've done to this dataset are as much raw formatting issues as they are internal data cleaning steps, which isn't perfectly conducive to finding all the problems at first, then solving them one by one.

Thanks!

Mike

adelnehme commented 4 years ago

Hi @mmetzger :wave:

It's fine by me if the layout of the tasks are somewhat out of order - especially since we are discovering problems in the dataset and are labeling them Problem # immediately as we go about the session. I wanted to use the Problem # sub-sub-sub section simply as clearer to way to connect which exact problem we're trying to solve for students. I think the fact that it's a bit out of order is a great opportunity for you to explain how cleaning data using PySpark can be a bit messy sometimes and discovering more data problems often happens when we solve for one data problem.

Let me know once the feedback is implemented into the final notebooks - and I'll generate the links 😄

Cheers,

Adel

mmetzger commented 4 years ago

Hi @adelnehme,

I think I've gotten the primary solution notebook updated in the method you're describing - please let me know if this meets the requirement or if I'm still missing something. Once verified, I'll update the student version.

Thanks!

Mike

adelnehme commented 4 years ago

Hi @mmetzger - I don't seem to see any changes regarding the Problem notation. However, given it's an extremely minor issue and the notebook already looks pretty spectacular - I suggest keeping it as is 😄 I would just make sure to verbally connect back a problem you are solving during the session to when it was diagnosed whenever you see fit. For example, you can mention something along the lines of "So if you remember while taking a look at the data in this section we found that there was X problem - now we're going to solve it by doing Y". I took the liberty of creating the final links for the session notebooks here - I will also be handing them over to Kelsey.