Open adelnehme opened 4 years ago
Hi @adelnehme,
Thank you for the extensive feedback - I think I've implemented everything accordingly - further comments inline.
General Feedback
This is looking like a great notebook! I want to caveat that I'm not a PySpark expert - so a lot of my questions/comments may come from that place and I'll be relying on your expertise for guidance 😄
I've specifically added some portions here that I hope will address some of the oddities of Spark. Originally this class was pitched as kind of taking this after taking my Cleaning Data with PySpark course, so I assumed the learners would have the appropriate levels of Spark knowledge. That said, I think it's a better notebook adding it, though there are likely places that are still confusing to someone unfamiliar with Spark. Please let me know if anything seems odd.
- [x] 🔧 A. Make sure to bold all sections, sub-sections and sub-sub sections.
- [x] 🔧 B. In many sections, I gave the feedback of using "Observation #: " and "Data Problem #:" - the reasoning behind this is because it will help you connect back problems to observations when going about your data cleaning problems. Check out how this is done in Data Cleaning with Python solution notebook
I think I implemented this ok, though it does not flow as well as the example above. The primary reason is it's difficult to determine all the problems in one session. Please let me know if my version works, or what should be changed.
Notebook Feedback
Getting Started
I think this code section should remain populated for the empty session (or student) notebook.
- [x] 🔧 1. It could be worthwhile here adding a brief markdown description on
wget
,ls
andgunzip
explaining that these are shell commands that we usually perform on the command line. Potentially adding a screenshot of a command line here would be great. I find using bullet point formats is clearer like this:Brief description on what we're doing and that we will be using shell commands to get the data
wget
does..ls
does..gunzip
does..
- [x] 🔧 2. Recommend removing the sub-headers for verifying the 7 files and taking a look at the first 20 rows as all of these steps are part of the getting started section.
- [x] 📣 3. It would be cool to point out what the
*
is doing when usingls
and that this is the only use of shell expected from us in the session.- [x] 📣 4. Keep a lookout to point that the separator from the output of
!gunzip -c /tmp/netflix_titles_dirty_03.csv.gz | head -20
is no a,
but a tab.
Done for all
Loading our initial dataframe
- [x] 🔧 5. Format the header here as a sub-section (##), and sub-sub-sections within it as sub-sub-sections (###).
- [x] 🔧 6. I find it slightly clearer and less cognitive overload on students to use bullet points to outline arguments. For example:
Let's take a look at what Spark does with our data and see if it can properly parse the output. To do this, we'll first load the content into a DataFrame using the
spark.read.csv()
method. We'll pass in three arguments:
- The path to the file(s)
- The
header
argument which takes inTrue
orFalse
. Since our files do not have a header row, so we must specifyheader = False
or risk a data row being interpreted as a header.- The
sep
argument which specifies the field separator. Often in CSV files, this is a comma,
, but in our files, it's a\t
or tab character.
Done
- [x] 🔍 📣 6. Is the command
titles_df = spark.read.csv('/tmp/netflix_titles_dirty*.csv.gz', header=False, sep='\t')
reading all the Netflix movie files? If so, I recommend highlighting that they these csv files are being coalesced together through the use of*
.
Added, but with the initial explanation about how Spark deals with files. Basically, nothing happens here beyond Spark looking for the files to exist - ie, they're not read until some action is performed. It's a bit hard to illustrate with this size dataset and in this environment, but hopefully the explanation works. Otherwise I should be able to elaborate in the Q&A.
Intitial analysis
- [x] 🔧 7. Use inline formatting around
False
.- [x] 🔍 8. It strikes as the initial analysis phase is about diagnosing some dirty data problems, for which they are:
- Identifying corrupt rows that start with comments and keeping in mind the field count as well
- Identifying that the column names need to be changed
- Identifying incorrect data types
With that in light, I think using the "Observation #: " format where after each dirty data problem diagnosis we make, we add "Observation #: Diagnosis". For example, after having run
titles_df.show(150, truncate=False)
- We can insert in markdown directly afterward:Observation 1: Taking a look at the first column, most entries are numeric IDs. If we scroll through the data we see at least one random text entry. Something must be out of the ordinary.
The same can be done after each diagnosis step you take.
Bypasssing CSV interpreter
- [x] 🔧 9. Consider adding a cell where you just count the entire dataframe and see the total rows - preceded by a markdown cell slightly touching upon the
.count()
method (1 sentence really) at the beginning of this section not at the end.- [x] 🔧 10. After having counted the dataframe, it could be worth adding a markdown cell stating:
"Let's count correct instances of the first column by casting it as an integer and trying to find all values that are not null. To do this, we will use the
functions
submodule from thepyspark.sql
module to filter on incorrect data. We can apply changes to a particular column by using the.col()
method, change its type chaining it with.cast()
- and find not null values by chaining with.isNotNull()
-- All of this is couched inside of the DataFrame.filter()
method. Here's how it would look like on an example of filtering on a PySpark dataframe nameddf_1
with a column named_c0
:from pyspark.sql import functions as F df_1.filter(F.col("_c0").cast("int").isNotNull()).show())
- [x] 🔧 11. After having exposed the problem, consider adding the format "Data Problem #: " format to diagnosed data problems. This will make it easier to connect it back to "Observations #:" and to solutions down the line.
Cleaning up our data
- [x] 🔧 12. Consider setting this section in Italic:
Note: We're doing things in a more difficult fashion than is absolutely necessary to illustrate options. The Spark CSV reader has an option for a comment property, which actually defaults to skipping all rows starting with a
#
character. That said, it only supports a single character - consider if you were looking for multi-character options (ie, a//
or/*
from C-style syntax). This feature is also only available in newer versions of Spark, where our method works in any of the2.x
Spark releases.Checking column counts
- [x] 🔧 13. Consider using bullet points when outlining the functions used here and more accurate inline formatting for functions
split()
.Creating typed columns
- [x] 🔧 14. Consider adding more markdown here where you introduce an example, for example:
... The
.getItem()
method returns the value at the specified index of the listcolumn (ie, ofsplitcolumn
). For example, here is this applied on an example DataFramedf_1
with a listcolumn namedlist_column
containing 3 elements as seen in the table below:df_1 = df_1.withColumn('id', df_1.list_column.getItem(0).cast(IntegerType()))
I added a markdown table here to illustrate the before / after scenarios.
- [x] 🔧 15. I get an error here stating "NameError: name 'IntegerType' is not defined"
Fixed - this was a missing import.
Even more cleanup
- [x] 🔧 📣 16. Worth noting here that this is a categorical data column and it should have a set amount of categories and collapsing categorical values is a common data cleaning problem.
UDF
- [x] 🔧 17. I get another error similar to 15 but about
StringType
- [x] 🔧 18. In the markdown introducing UDF - consider adding a "dummy" example similar to how I've been proposing with
df_1
in 14 for example.
Done - the example may be a bit long. Please let me know if I should shorten it.
Saving data for analysis / further processing
- [x] 🔧 19. Would it be possible to have some code that lets student download their files immediately? I tried using this but it doesn't work.
Heh, this was tricky on a couple fronts. First, when you save a "file" in Spark. it actually creates a directory with that name and saves all the partitions of data into their own separate files along with a couple of metadata files. These internal files have names like: part-00000-2ae69860-b62a-4086-966b-d59fe32777fe-c000.snappy.parquet
. Beyond that, Parquet files aren't readable in normal desktop tools so even if they could download them, it wouldn't be much use.
As such, I did the following:
netflix_titles_cleaned.parquet
)!ls
command to see the contents, noting that it's a directory containing separate files.!ls
the contents to see the single file now!mv
to rename the file to something meaningful (/tmp/netflix_titles_cleaned_final.csv
)google.colab.files.download()
function. This seems to work properly on my test system.Thanks again for all the feedback and for making this a much better notebook!
Mike
Hi @mmetzger :wave:
Thanks for implementing the feedback -- it' looking really really solid! I've taken the liberty of taking in the solution notebook and creating out of it the student notebook -- here are a list of relevant links for you:
I recommend doing 1 small change to both of these notebooks:
Hi @adelnehme,
A clarification on this one - are you ok with the layout of the tasks being added to as the notebook progresses / the tasks being solved somewhat out of order? The modifications I've done to this dataset are as much raw formatting issues as they are internal data cleaning steps, which isn't perfectly conducive to finding all the problems at first, then solving them one by one.
Thanks!
Mike
Hi @mmetzger :wave:
It's fine by me if the layout of the tasks are somewhat out of order - especially since we are discovering problems in the dataset and are labeling them Problem #
immediately as we go about the session. I wanted to use the Problem #
sub-sub-sub section simply as clearer to way to connect which exact problem we're trying to solve for students. I think the fact that it's a bit out of order is a great opportunity for you to explain how cleaning data using PySpark can be a bit messy sometimes and discovering more data problems often happens when we solve for one data problem.
Let me know once the feedback is implemented into the final notebooks - and I'll generate the links 😄
Cheers,
Adel
Hi @adelnehme,
I think I've gotten the primary solution notebook updated in the method you're describing - please let me know if this meets the requirement or if I'm still missing something. Once verified, I'll update the student version.
Thanks!
Mike
Hi @mmetzger - I don't seem to see any changes regarding the Problem notation. However, given it's an extremely minor issue and the notebook already looks pretty spectacular - I suggest keeping it as is 😄 I would just make sure to verbally connect back a problem you are solving during the session to when it was diagnosed whenever you see fit. For example, you can mention something along the lines of "So if you remember while taking a look at the data in this section we found that there was X problem - now we're going to solve it by doing Y". I took the liberty of creating the final links for the session notebooks here - I will also be handing them over to Kelsey.
Hi @mmetzger :wave: Please read the key below to understand how to respond to the feedback provided. Some items will require you to take action while others only need some thought. During this round of feedback, each item with an associated checkbox is an action item that should be implemented before you submit your content for review.
Key
:wrench: This must be fixed during this round of review. This change is necessary to create a good DataCamp live training.
:mag: This wasn't exactly clear. This will need rephrasing or elaborating to fully and clearly convey your meaning.
📣 This is something you should take a lookout for during the session and verbally explain once you arrive at this point.
:star: Excellent work!
General Feedback
This is looking like a great notebook! I want to caveat that I'm not a PySpark expert - so a lot of my questions/comments may come from that place and I'll be relying on your expertise for guidance 😄
[x] 🔧 A. Make sure to bold all sections, sub-sections and sub-sub sections.
[x] 🔧 B. In many sections, I gave the feedback of using "Observation #: " and "Data Problem #:" - the reasoning behind this is because it will help you connect back problems to observations when going about your data cleaning problems. Check out how this is done in Data Cleaning with Python solution notebook
Notebook Feedback
Getting Started
I think this code section should remain populated for the empty session (or student) notebook.
wget
,ls
andgunzip
explaining that these are shell commands that we usually perform on the command line. Potentially adding a screenshot of a command line here would be great. I find using bullet point formats is clearer like this:[x] 🔧 2. Recommend removing the sub-headers for verifying the 7 files and taking a look at the first 20 rows as all of these steps are part of the getting started section.
[x] 📣 3. It would be cool to point out what the
*
is doing when usingls
and that this is the only use of shell expected from us in the session.[x] 📣 4. Keep a lookout to point that the separator from the output of
!gunzip -c /tmp/netflix_titles_dirty_03.csv.gz | head -20
is no a,
but a tab.Loading our initial dataframe
[x] 🔧 5. Format the header here as a sub-section (##), and sub-sub-sections within it as sub-sub-sections (###).
[x] 🔧 6. I find it slightly clearer and less cognitive overload on students to use bullet points to outline arguments. For example:
titles_df = spark.read.csv('/tmp/netflix_titles_dirty*.csv.gz', header=False, sep='\t')
reading all the Netflix movie files? If so, I recommend highlighting that they these csv files are being coalesced together through the use of*
.Intitial analysis
[x] 🔧 7. Use inline formatting around
False
.[x] 🔍 8. It strikes as the initial analysis phase is about diagnosing some dirty data problems, for which they are:
With that in light, I think using the "Observation #: " format where after each dirty data problem diagnosis we make, we add "Observation #: Diagnosis". For example, after having run
titles_df.show(150, truncate=False)
- We can insert in markdown directly afterward:The same can be done after each diagnosis step you take.
Bypasssing CSV interpreter
[x] 🔧 9. Consider adding a cell where you just count the entire dataframe and see the total rows - preceded by a markdown cell slightly touching upon the
.count()
method (1 sentence really) at the beginning of this section not at the end.[x] 🔧 10. After having counted the dataframe, it could be worth adding a markdown cell stating:
Cleaning up our data
Checking column counts
split()
.Creating typed columns
Even more cleanup
UDF
[x] 🔧 17. I get another error similar to 15 but about
StringType
[x] 🔧 18. In the markdown introducing UDF - consider adding a "dummy" example similar to how I've been proposing with
df_1
in 14 for example.Saving data for analysis / further processing