Closed hfboyce closed 4 years ago
FYI I have plans to review this on Thursday.
pandas
functions for reshaping data:"pivot
. Also, I'm quite confused, because the dataframe shown looks like something violating Criterion 1 but you talk about Criterion 2. What we want here is 1. Show long df. 2. Talk about how/why it is long and violates Criterion 1. 3. Show code and result of pivot. 4. Talk about it.pivot
more so than from all the preceding slides. This slide is where the magic happens.nutrition
-> Nutrition
and same with calories and protein. Oh, actually it's nutrition
in the real data. So maybe you can fix the figures to match the data (lower case) ?pivot
is for Criterion 1. Reword this?melt
, but this syntax seems really sloppy. Is there no better way? If the df had 100 columns you really wouldn't want to list them all our as id_vars
.set_index
, it was probably brief as I've forgotten.Candy
-> candy
merge
merge
has overlapping functionality with concat(axis=0)
but not with concat(axis=1)
? If so, should we make this more explicit?Loading...
)right_index=True
. Have they learned this?store_inventory_details
Overall, I know you put a ton of work into this, and it shows, so you might not be happy to hear this, but I feel this module needs more work than the previous ones I reviewed (see comments above). In particular, I feel Exercises 5 and 9 need quite a bit of work, and that we might need a new Exercise on indexes. That being said, it's a great start and we are making progress. Don't be discouraged!
I'm confused by the notion of stacked/unstacked vs. long/wide. Let's discuss this.
Pivot/melt and stack/unstack can do the exact same thing. Some people prefer stack/unstack over pivot and melt. I including this in the module because I generally tried to include everything in the module that the python part of DSCI 523 had (except of course with more!). Tom talked about stacking/unstacking for ~10 mins in one of his lecture and I know that multi-indexing came up for Imbellus’s take home assignment. I think it’s important to include because of the preference some companies have. That being said I also think we should still teach pivot/melt since it’s a bit more clear for beginners so now I am conflicted on where we should go with it.
I have a great Youtube video resource that explains this well in a Jupyter notebook.
We can discuss this in our meeting tomorrow.
@mgelbart OK! Buckle up! I edited and made the changes you suggested and revamped 4 sections.
They are quite different but luckily the exercises could stay relatively constant.
Hopefully this works a lot better. I made some new viz for melt and pivot and removed the ones you did not like. I also fixed the gifs for concat and merge.
There are now 20 exercises.
Nutrition
-> nutrition
NA
a thing in Python or is that from R?special_attack_defense
that I don't see in the data. But I do see it in 4. I think the dataframes are swapped.pivot()
can be used to covert a long dataframe into a wide dataframe.name
and nutrition
and value
. I got a bit confused with the current version, because the argument names and the column names both appear in code font
and it's a bit ambiguous what is what, at least without thinking carefully. Also, if possible, I would love to show both dataframes here. The problem is that the code cereal_long.pivot(index='name', columns='nutrition', values='value')
is referring to column names in the original df, but we can't see it. We need to be able to connect the code to the df on the same slide. This isn't reproducible, but maybe an image would be better, and you can circle those 3 column names? Update: see my comment for 5.8.tidy_pivot
to cereal_wide
or cereal_tidy
?reset_index()
. Here's what it does:" before the code). Remember, start with an example! Human attention spans are very short, people may glaze over reading 2 sentences about reset_index
before they see it in action.nutrition
label when that label is not visible on the current slide. Slides make everything so much harder; I'm used to just flowing through a notebook. Also a typo "tosomething". pivot_table
as its own Exercise and adding some interactive stuff in between.I am calling it a day - will do Exercise 6 onwards at a later time.
5.3: This slide makes it seem like long is more tidy than wide. But that's not true. In Exercise 1 we have the cereal data where the long version is untidy and the wide version is tidy. So, I think we need to make this a bit clearer. The most amazing thing would if you can come up with a single example and 3 formats: too long, just right, and too wide. Is that doable? I think it also depends on the application. Because, for this chocolate bar dataset, I'd actually prefer the "too wide" format if I was doing supervised learning. It really depends what you're doing. So maybe an alternative to my 3 formats suggestion is to have 2 formats and 2 questions, one question where the wide format is tidy and one where the long format is tidy? 🤔 Also, I don't love the detour from cereals to chocolate bars, but I can live with it if needed.
I think I have an idea for this I would like to show you.
5.6: When you explain each argument, I think it would be more useful to explain what it does in general, without making specific reference to name and nutrition and value. I got a bit confused with the current version, because the argument names and the column names both appear in code font and it's a bit ambiguous what is what, at least without thinking carefully. Also, if possible, I would love to show both dataframes here. The problem is that the code cereal_long.pivot(index='name', columns='nutrition', values='value') is referring to column names in the original df, but we can't see it. We need to be able to connect the code to the df on the same slide. This isn't reproducible, but maybe an image would be better, and you can circle those 3 column names? Update: see my comment for 5.8.
Can we discuss this further? I had a slide that did exactly this but this is what you said regarding it so I amended according to the comment you said below. I think I am not understanding correctly what you are suggesting.
"5.4: again, going back to the same teaching strategy: be concrete, not abstract. Here, show a dataframe. Then, for each argument, give an example that corresponds to the df we're looking at. People hardly ever understand anything abstract like this unless it comes after the concrete."
5.4 This is what it was before :
5.8 I like this! What if you also showed an image of the line of code here, and had arrows between the column names in the code and the column names on the left dataframe? Or you could have one slide with just what you have, and then another slide where the code and arrows are added in or something? Update: see my comments for 5.9.
Does this mean I can leave slide 5.8 -5. 9 as is for now? (besides making more room for text?)
5.22: row -> rows? Also, did we learn drop for rows? I mainly remember it for columns.
Adding it in !!!
Exercise 5 is really long - I suggest putting pivot_table as its own Exercise and adding some interactive stuff in between.
I'll make changes on saturday for this.
I've addressed the majority of the issues and will push them all tomorrow. My 2 biggest things I want to confirm are 5.3 and 5.6.
I'll just keep going for now.
melt
makes the data tidier. This relates to my earlier comments. Or maybe that's coming, let's see...~
before?~
)group_by
-> groupby
Discuss:
5.3: This slide makes it seem like long is more tidy than wide. But that's not true. In Exercise 1 we have the cereal data where the long version is untidy and the wide version is tidy. So, I think we need to make this a bit clearer. The most amazing thing would if you can come up with a single example and 3 formats: too long, just right, and too wide. Is that doable? I think it also depends on the application. Because, for this chocolate bar dataset, I'd actually prefer the "too wide" format if I was doing supervised learning. It really depends what you're doing. So maybe an alternative to my 3 formats suggestion is to have 2 formats and 2 questions, one question where the wide format is tidy and one where the long format is tidy? 🤔 Also, I don't love the detour from cereals to chocolate bars, but I can live with it if needed.
Made images. Don't know if they will work. Will show in meeting
5.6: When you explain each argument, I think it would be more useful to explain what it does in general, without making specific reference to
name
andnutrition
andvalue
. I got a bit confused with the current version, because the argument names and the column names both appear incode font
and it's a bit ambiguous what is what, at least without thinking carefully. Also, if possible, I would love to show both dataframes here. The problem is that the codecereal_long.pivot(index='name', columns='nutrition', values='value')
is referring to column names in the original df, but we can't see it. We need to be able to connect the code to the df on the same slide. This isn't reproducible, but maybe an image would be better, and you can circle those 3 column names? Update: see my comment for 5.8.
See above comment
I wonder, though, if we could come up with a compelling use case where
melt
makes the data tidier. This relates to my earlier comments. Or maybe that's coming, let's see...
Discussion regarding wording.
I'm very confused by the true/false. Isn't it less tidy now?
Not if opacity is considered a singled variable which i've amended now.
Moved Tilde to Module 2
The binder experience isn't very smooth here in general, hmm, oh well.
😭
"Ah, it appears we have multiple rows for some of the same sets." -> that is true, but are they asked to do something which would lead them to this conclusion?
I wrote something to make this a little clearer.
I didn't review this one that thoroughly.
Should I remove it if you were not engaged?
I ~may do another read through Monday morning but Module 3 is done~ have done another read through and I think I am ready for round 1 feedback (note: I will be implementing round 2 Module 2 feedback first) :
Link
It has 25 exercises.