Feedback Module 3: Round 1

I ~may do another read through Monday morning but Module 3 is done~ have done another read through and I think I am ready for round 1 feedback (note: I will be implementing round 2 Module 2 feedback first) :

Link
It has 25 exercises.

FYI I have plans to review this on Thursday.

1

[x] 1.2: They probably don't know who Hadley Wickham is. Maybe "Renowned data scientist and software developer Hadley Wickham", or something?
[x] "A tidy data is one that is satisfied by these three criteria:" -> "Tidy data satisfies the following three criteria:"
[x] "Source" -> "Image Source" ?
[x] "data frame" -> "dataframe"
[x] 1.3 "be structured" -> "is structured"
[x] "This standard now sets precedent for the input arguments" -> "This approach allows us to standardize input arguments"
[x] 1.5-1.7 and again later on: move the criterion definition from italics up to being part of the heading, so it's harder to miss
[x] 1.8: change period to colon
[x] 1.10: let's give a concrete example of why it would be problematic to work with. Give a task like finding the average calories or maybe something more compelling.

2

[x] Bold not in Q2

3

[x] Can you change True/False to Yes/No?
[x] The "error message" has a grammar issue - needs to be split into 2 sentences.
[x] Can you make it something more obvious? I actually got this one wrong. Could they be strings "a=122, d=120" or something?

4

[x] Got this one wrong as well 😢 Can you make it more obvious by having the repeated pokemon names next to each other?

5

[x] 5.2: they provide more than 2 functions. change it to "We'll explore two useful pandas functions for reshaping data:"
[x] I think it would be useful to define "wide" and "long" earlier, in Exercise 1, so they are ready with these concepts. It's too much cognitive load to understand melt/pivot without already knowing wide/long.
[ ] I don't find the animation helpful, personally. Do you? It would be much easier for me if it used data we are familiar with, e.g. cereal, but at this point I don't think that would be a good use of time. Or, at least if you want to keep the current animation, move it much later. It could be useful as a summary, but is way too hard to understand before the person knows about melt/pivot.
[ ] 5.3: needs reordering. First show the dataframe. That sets the context in the learner's mind. Then start talking about pivot. Also, I'm quite confused, because the dataframe shown looks like something violating Criterion 1 but you talk about Criterion 2. What we want here is 1. Show long df. 2. Talk about how/why it is long and violates Criterion 1. 3. Show code and result of pivot. 4. Talk about it.
[ ] 5.4: again, going back to the same teaching strategy: be concrete, not abstract. Here, show a dataframe. Then, for each argument, give an example that corresponds to the df we're looking at. People hardly ever understand anything abstract like this unless it comes after the concrete.
[ ] 5.6: The text at the bottom is very very slightly cut off. Can you remove some of the whitespace under the animation? Again, I'm really fairly opposed to these animations, but it might just be me. I feel like they only make sense to someone who already knows what we are trying to teach. It's too fast and I can't follow. If I were to use one of these, I'd first show the input and output and given the reader as long as they want to try to match these up in their heads. Then on the next slide I'd show the animation, which they can use to check if their mental model matches up with the correct one. So back to the earlier point, I'd use these for summarization or sanity checking but not learning.
[ ] 5.7: this is exactly what I was suggesting! And furthermore, this is way better because it's a dataset they are used to seeing, which is very important. The student will stare at this slide for several minutes and by the end they will understand pivot more so than from all the preceding slides. This slide is where the magic happens.
[x] nutrition -> Nutrition and same with calories and protein. Oh, actually it's nutrition in the real data. So maybe you can fix the figures to match the data (lower case) ?
[ ] 5.10: showing them both on the same slide: Yes!
[ ] OK, so slides 5.7-5.10 are so much more effective than 5.2-5.6. I would throw away as much of 5.2-5.6 as you can (some of it feels redundant, like the multiple animations showing roughly the same thing), and move the rest to after 5.7-5.10.
[ ] The reset_index thing is indeed pretty confusing, and I feel like it's one thing I still don't understand by 5.11 (some of this is new to me so I can be a genuine learner here). I'd definitely put all the reset_index stuff after the main points. Can you explain it more though? What happens if you don't do it? Why? Are you sure we need it? Actually, I think what we need is for reset_index to be explicitly covered in detail separately from, and before, any of this pivot stuff. The problem is trying to learn them at the same time. I don't remember covering this earlier - or if so, not in detail.
[ ] 5.11: "wil" -> "will"
[ ] 5.12: as usual, I'd prefer to show the before df as well, not just after. Supposedly you can only keep ~5 things in your mind at once. So we want to rely on memory sparingly.

7

[x] We should show them the lego df before asking them to do something to do
[x] For this, and/or Exercise 8, could we have a "part 2" where they do something with the tidy data? It would be more motivating to see that the tidy data is easier to work with than the untidy data, rather than just doing it because you asked.

9

[ ] 9.2: you already know what I'm going to say here!
[ ] 9.3: the "similarly" might be confusing because pivot is for Criterion 1. Reword this?
[ ] Any particular reason to use a different dataset here and in Ex 5? I think there would be a benefit to using the same one. In general we seem to be jumping around a lot with datasets - I see there is cereal, chocolate, candy, and more.
[ ] A bunch of my comments from Exercise 5 apply here, I won't re-write them.
[ ] 9.7: mention chocolate before white_chocolate to be consistent with the order in the df
[ ] 9.7-9.8: I would first show the goal (what you want the end result to look like) before going into these details.
[ ] 9.8: I've never used melt, but this syntax seems really sloppy. Is there no better way? If the df had 100 columns you really wouldn't want to list them all our as id_vars.

10

[x] Q3: "transforms" -> "transform"

11

[x] Again, can we show the original df somehow?

12

[ ] 12.3: need a semicolon or period before "however". also "dataframe" -> "dataframes"
[ ] 12.5: again, if we covered set_index, it was probably brief as I've forgotten.
[ ] Given that we're really getting into indexes, I believe even more strongly now that a serious section on indexes would be useful somewhere, perhaps at/near the start of Module 3.
[ ] I'm a bit confused by the difference between what I'm seeing in 12.6 and what I saw in 12.4. These are both presented as a multi-index situation but they look quite different.
[ ] 12.7: change period to comma before "One"
[ ] 12.8: heh, what I was wondering is what the point of this is. It would be good to show the motivation before too long into the lesson.
[ ] 12.11: this is getting pretty intense. In the meeting I'd like to discuss whether we need to cover this level of detail in the course. I feel like we might be getting into the territory of diminishing returns.

13

[x] Q1: "in" -> "by" ?
[ ] Q2: I'm confused by the notion of stacked/unstacked vs. long/wide. Let's discuss this.

15

[ ] The usual: would like to see the original df. I wonder how this issue didn't come up in Modules 1 and 2?

17

[x] 17.2 I'd remove this sentence: " Single dataframes can be great to see all your data in one convenient place, however, this is less convenient when it comes to storage space"
[x] I'd remove the mention of companies. Organizations?
[x] 17.3: "where" -> "which"
[x] 17.4: remove the first "our"
[x] 17.5 "identical as" -> "identical to"
[x] 17.6 maybe clarify that axis=0 refers to rows and axis=1 refers to columns. Have they seen this before?
[x] 17.9: Kinder Bueno 😋
[ ] 17.11: "only the rows" - I guess the wording here assumes axis=1. That is fine but not sure if it should be clarified? This actually comes in in Exercise 18 as well.
[x] I like what's happening here, it is pretty clear. But I wonder if it might be more clear if we use even smaller dataframes. Like, just a couple columns, and just a few rows each. I'm not sure if this is worth doing, just have a feeling it would be even more clear that way.

19

Hint that they should use .shape?

20

[x] 21.3: "act as connect" -> "act as the connection"
[x] Candy -> candy
[ ] 21.4: need a period after 1st sentence
[ ] period or semicolon before "however"
[ ] 21.5: merge -> merge
[ ] 21.6: again, i think this is unnecessary jumping between datasets. If we're doing candy bars, let's stick with that.
[ ] 21.6: I'd show the line of code here (currently on next slide) for reference
[x] 21.8: as mentioned earlier, this might be easier to digest with smaller dataframes.
[ ] Am I correct in saying that merge has overlapping functionality with concat(axis=0) but not with concat(axis=1) ? If so, should we make this more explicit?

23

[ ] Did not review the MC, the code was taking a super long time to load/run from Binder (specifically, it was stuck on Loading...)

24

[ ] The solution uses right_index=True. Have they learned this?
[ ] in the comment, thats -> that's
[ ] formatting issue: store_inventory_details should be store_inventory_details

Overall, I know you put a ton of work into this, and it shows, so you might not be happy to hear this, but I feel this module needs more work than the previous ones I reviewed (see comments above). In particular, I feel Exercises 5 and 9 need quite a bit of work, and that we might need a new Exercise on indexes. That being said, it's a great start and we are making progress. Don't be discouraged!

I'm confused by the notion of stacked/unstacked vs. long/wide. Let's discuss this.

Pivot/melt and stack/unstack can do the exact same thing. Some people prefer stack/unstack over pivot and melt. I including this in the module because I generally tried to include everything in the module that the python part of DSCI 523 had (except of course with more!). Tom talked about stacking/unstacking for ~10 mins in one of his lecture and I know that multi-indexing came up for Imbellus’s take home assignment. I think it’s important to include because of the preference some companies have. That being said I also think we should still teach pivot/melt since it’s a bit more clear for beginners so now I am conflicted on where we should go with it.

I have a great Youtube video resource that explains this well in a Jupyter notebook.

We can discuss this in our meeting tomorrow.

@mgelbart OK! Buckle up! I edited and made the changes you suggested and revamped 4 sections.

pivot
melt
concat
merge

They are quite different but luckily the exercises could stay relatively constant.

Hopefully this works a lot better. I made some new viz for melt and pivot and removed the ones you did not like. I also fixed the gifs for concat and merge.

There are now 20 exercises.

1

[x] 1.12 Nutrition -> nutrition

2

[x] 2.2 is NA a thing in Python or is that from R?

3, 4

[x] I think these two got mixed up. The solution to 3 refers to a column special_attack_defense that I don't see in the data. But I do see it in 4. I think the dataframes are swapped.

5

[ ] 5.3: This slide makes it seem like long is more tidy than wide. But that's not true. In Exercise 1 we have the cereal data where the long version is untidy and the wide version is tidy. So, I think we need to make this a bit clearer. The most amazing thing would if you can come up with a single example and 3 formats: too long, just right, and too wide. Is that doable? I think it also depends on the application. Because, for this chocolate bar dataset, I'd actually prefer the "too wide" format if I was doing supervised learning. It really depends what you're doing. So maybe an alternative to my 3 formats suggestion is to have 2 formats and 2 questions, one question where the wide format is tidy and one where the long format is tidy? 🤔 Also, I don't love the detour from cereals to chocolate bars, but I can live with it if needed.
[x] 5.4: going along the same lines, the wording here makes it seem like wide is tidy and long is not. Let's just say pivot() can be used to covert a long dataframe into a wide dataframe.
[ ] 5.6: When you explain each argument, I think it would be more useful to explain what it does in general, without making specific reference to name and nutrition and value. I got a bit confused with the current version, because the argument names and the column names both appear in code font and it's a bit ambiguous what is what, at least without thinking carefully. Also, if possible, I would love to show both dataframes here. The problem is that the code cereal_long.pivot(index='name', columns='nutrition', values='value') is referring to column names in the original df, but we can't see it. We need to be able to connect the code to the df on the same slide. This isn't reproducible, but maybe an image would be better, and you can circle those 3 column names? Update: see my comment for 5.8.
[x] 5.7 just FYI the lower df is a bit cut off for me. Also, could we rename tidy_pivot to cereal_wide or cereal_tidy ?
[x] 5.8 I like this! What if you also showed an image of the line of code here, and had arrows between the column names in the code and the column names on the left dataframe? Or you could have one slide with just what you have, and then another slide where the code and arrows are added in or something? Update: see my comments for 5.9.
[x] 5.9: Oh there we go! This is great. I can live without the arrows. The text is a bit cut off for me. But I suspect some of that text will get moved into the script rather than staying on the slide as text.
[x] Again, text may go to script. But as it stands, I'd move the text after the code (and just keep something like, "Let's take a brief detour to discuss reset_index(). Here's what it does:" before the code). Remember, start with an example! Human attention spans are very short, people may glaze over reading 2 sentences about reset_index before they see it in action.
[x] 5.11: I can tell I'm being very picky here, but I'll just say what's on my mind. I don't love the idea of referring to the nutrition label when that label is not visible on the current slide. Slides make everything so much harder; I'm used to just flowing through a notebook. Also a typo "tosomething".
[x] The variable name change I suggested earlier should carry through here. For teaching them good habits all these variable names should contain "cereal".
[x] 5.13: I am loving these.
[x] 5.19 "Atribution" -> "Attribution"
[x] 5.22: row -> rows? Also, did we learn drop for rows? I mainly remember it for columns.
[x] Exercise 5 is really long - I suggest putting pivot_table as its own Exercise and adding some interactive stuff in between.

I am calling it a day - will do Exercise 6 onwards at a later time.

5.3: This slide makes it seem like long is more tidy than wide. But that's not true. In Exercise 1 we have the cereal data where the long version is untidy and the wide version is tidy. So, I think we need to make this a bit clearer. The most amazing thing would if you can come up with a single example and 3 formats: too long, just right, and too wide. Is that doable? I think it also depends on the application. Because, for this chocolate bar dataset, I'd actually prefer the "too wide" format if I was doing supervised learning. It really depends what you're doing. So maybe an alternative to my 3 formats suggestion is to have 2 formats and 2 questions, one question where the wide format is tidy and one where the long format is tidy? 🤔 Also, I don't love the detour from cereals to chocolate bars, but I can live with it if needed.

I think I have an idea for this I would like to show you.

5.6: When you explain each argument, I think it would be more useful to explain what it does in general, without making specific reference to name and nutrition and value. I got a bit confused with the current version, because the argument names and the column names both appear in code font and it's a bit ambiguous what is what, at least without thinking carefully. Also, if possible, I would love to show both dataframes here. The problem is that the code cereal_long.pivot(index='name', columns='nutrition', values='value') is referring to column names in the original df, but we can't see it. We need to be able to connect the code to the df on the same slide. This isn't reproducible, but maybe an image would be better, and you can circle those 3 column names? Update: see my comment for 5.8.

Can we discuss this further? I had a slide that did exactly this but this is what you said regarding it so I amended according to the comment you said below. I think I am not understanding correctly what you are suggesting.

"5.4: again, going back to the same teaching strategy: be concrete, not abstract. Here, show a dataframe. Then, for each argument, give an example that corresponds to the df we're looking at. People hardly ever understand anything abstract like this unless it comes after the concrete."

5.4 This is what it was before :

5.8 I like this! What if you also showed an image of the line of code here, and had arrows between the column names in the code and the column names on the left dataframe? Or you could have one slide with just what you have, and then another slide where the code and arrows are added in or something? Update: see my comments for 5.9.

Does this mean I can leave slide 5.8 -5. 9 as is for now? (besides making more room for text?)

5.22: row -> rows? Also, did we learn drop for rows? I mainly remember it for columns.

Adding it in !!!

Exercise 5 is really long - I suggest putting pivot_table as its own Exercise and adding some interactive stuff in between.

I'll make changes on saturday for this.

I've addressed the majority of the issues and will push them all tomorrow. My 2 biggest things I want to confirm are 5.3 and 5.6.

I'll just keep going for now.

6

[x] 6.2: bold not ?

7

[x] LGTM

8

[x] Great! The end result is almost begging for a plot...

9

[x] 🏅 ⭐ !!
[ ] I wonder, though, if we could come up with a compelling use case where melt makes the data tidier. This relates to my earlier comments. Or maybe that's coming, let's see...

11

[ ] I'm very confused by the true/false. Isn't it less tidy now?

12

[x] 12.2: "we use" -> "we will use" (because there are also others)
[x] 12.2: love the analogies
[x] 12.3: i like the animation
[x] 12.7: have they seen ~ before?
[x] 12.7: text is cut off for me
[x] 12.8:
- [x] oh, now we have the tilde. Maybe we should move this to the filtering part of Module 2?
- [ ] start with Tilde (~)
- [x] compliment -> complement
- [x] can't see the result of the last line of code
[x] 12.11: output is cut off
[x] 12.13: mention that we're going back to horizontal concatenation?
[x] 12.14: i think two slides got merged?

16

[x] 16.2: dataframe -> dataframes (in 2 places)
[x] 16.4: androws -> and rows
[x] 16.4: at this point it'd be good to give a high-level overview of what type of merging we're going to be working towards - is it horizontal, vertical, something else entirely?
[x] 16.5: this is really well done
[x] 16.6: I think it's better to stick with the candy bars, the cereal was a bit jarring
[x] 16.7: again, really clear and well done
[x] 16.8: start a new sentence before "in the future"; we -> We
[x] 16.10: this could be an opportunity for a non-reproducible figure of this same df, where you circle the 3 parts: present in left only, present in right only, present in both
[x] I really like 16

18

[ ] The binder experience isn't very smooth here in general, hmm, oh well.
[x] 18.1: don't -> doesn't
[x] 18.2: group_by -> groupby

19

[x] "Ah, it appears we have multiple rows for some of the same sets." -> that is true, but are they asked to do something which would lead them to this conclusion?
[ ] I didn't review this one that thoroughly.

Discuss:

5

5.3: This slide makes it seem like long is more tidy than wide. But that's not true. In Exercise 1 we have the cereal data where the long version is untidy and the wide version is tidy. So, I think we need to make this a bit clearer. The most amazing thing would if you can come up with a single example and 3 formats: too long, just right, and too wide. Is that doable? I think it also depends on the application. Because, for this chocolate bar dataset, I'd actually prefer the "too wide" format if I was doing supervised learning. It really depends what you're doing. So maybe an alternative to my 3 formats suggestion is to have 2 formats and 2 questions, one question where the wide format is tidy and one where the long format is tidy? 🤔 Also, I don't love the detour from cereals to chocolate bars, but I can live with it if needed.

Made images. Don't know if they will work. Will show in meeting

5.6: When you explain each argument, I think it would be more useful to explain what it does in general, without making specific reference to name and nutrition and value. I got a bit confused with the current version, because the argument names and the column names both appear in code font and it's a bit ambiguous what is what, at least without thinking carefully. Also, if possible, I would love to show both dataframes here. The problem is that the code cereal_long.pivot(index='name', columns='nutrition', values='value') is referring to column names in the original df, but we can't see it. We need to be able to connect the code to the df on the same slide. This isn't reproducible, but maybe an image would be better, and you can circle those 3 column names? Update: see my comment for 5.8.

See above comment

9

I wonder, though, if we could come up with a compelling use case where melt makes the data tidier. This relates to my earlier comments. Or maybe that's coming, let's see...

Discussion regarding wording.

11

I'm very confused by the true/false. Isn't it less tidy now?

Not if opacity is considered a singled variable which i've amended now.

12

Moved Tilde to Module 2

18

The binder experience isn't very smooth here in general, hmm, oh well.

😭

19

"Ah, it appears we have multiple rows for some of the same sets." -> that is true, but are they asked to do something which would lead them to this conclusion?

I wrote something to make this a little clearer.

I didn't review this one that thoroughly.

Should I remove it if you were not engaged?

UBC-MDS / programming-in-python-for-data-science

Feedback Module 3: Round 1 #22

1

2

3

4

5

7

9

10

11

12

13

15

17

19

20

23

24

1

2

3, 4

5

6

7

8

9

11

12

16

18

19

5

9

11

12

18

19