DS4PS / cpp-526-sum-2020

Course shell for CPP 526 Foundations of Data Science I for Summer 2020.
http://ds4ps.org/cpp-526-sum-2020/
MIT License
2 stars 1 forks source link

Lab 02 - Question 3 - Proportion of Commercial built after 1980 #8

Open pharri14 opened 4 years ago

pharri14 commented 4 years ago

I have been staring at this line of code for over an hour trying to get the correct output and feel as though I am missing something simple.

I know that there are 16 commercial buildings that have been built since 1980 with the code:

these.3.3 <- downtown$yearbuilt > "1980" & downtown$landuse == "Commercial"

sum( these.3.3, na.rm = T)

I also know that there are 209 commercial properties in the dataset with the code:

these.3.2 <- downtown$landuse == "Commercial"

sum(these.3.2, na.rm = F)

The issue I am having is showing the calculation of the proportion of commercial buildings that have been built since 1980 without having my code seem so derivative:

these.3.2 <- downtown$landuse == "Commercial"
these.3.3 <- downtown$yearbuilt > "1980" & downtown$landuse == "Commercial"

total.commercial <- sum(these.3.2, na.rm = F)

commercial.1980 <- sum( these.3.3, na.rm = T)

(commercial.1980 / total.commercial) *100
gzbib commented 4 years ago

Hey,

I am not sure if you should code all of this to get the proportion. I only used the mean function. I got the same map as illustrated in the question so I think that's it. Maybe we will wait for Dr. Jamison to give us his feedback.

jamisoncrawford commented 4 years ago

Hi @pharri14 and @gzbib - @pharri14, I believe you have the answer you seek. It looks as though you've fleshed out the logic and rationale very well and, indeed, this does take three arithmetic calculations:

  1. sum() of commercial & yearbuilt greater than or equal to 1980
  2. sum() of commercial
  3. / (division) using (1) as the numerator and (2) as the denominator

In fact, this is quite succinct! I would only recommend that you reconsider your naming conventions for the code to be even more decipherable. Rather than naming objects after the components of a question, it could appear less derivative if it is something like:

(con_80 / con_all) * 100

Alternatively, you can do it in a fell swoop, if I may use your code:

sum(downtown$landuse == "Commercial", na.rm = TRUE) /
sum(downtown$yearbuilt > "1980" & downtown$landuse == "Commercial", na.rm = TRUE) * 100

Another approach might be similar to @gzbib's with mean(), but you may want to isolate "Commercial" properties, first, from the rest of the downtown dataset.

dt_com <- downtown[downtown$landuse == "Commercial", ]
mean(dt_com$yearbuilt >= 1980, na.rm = TRUE)

Is this helpful? (I think it might be a good thing if you believe this should be more difficult than it is!).

gzbib commented 4 years ago

Hi Sir,

Can you explain further why I need to isolate the Commercial properties and use the subset?

jamisoncrawford commented 4 years ago

Sure, @gzbib! That's just one approach.

Technically, both approaches are isolating "Commercial" properties. Let's look at the question:

What proportion of commercial properties are built since 1980?

Regardless of whether we use relational operators like >=, >, or == to determine which properties are both "Commercial" and yearbuilt (during or) later than 1980, we're only looking at "Commercial" properties. Hence, we're using a subset (downtown$landuse == "Commercial) of a larger dataset (downtown).

By making the dataset smaller using the technique I showed above, i.e.

dt_com <- downtown[downtown$landuse == "Commercial", ]

...we're explicitly making the dataset smaller (subsetting) and storing that subset in an object (dt_com).

The other methods above look at the same subset of data but don't explicitly reduce the size of the dataset to only include "Commercial" properties. Is that helpful?

jamisoncrawford commented 4 years ago

Put another way, imagine we had the records of all students at ASU. We want to know which students enrolled in CPP 526 during Summer, 2020 are over the age of 30.

In this case, we take a subset of ASU's total student population, i.e. only students enrolled in CPP 526. Then, we determine the proportion of this course's students who are over 30 years old. We're not looking at all ASU students over 30, just the ones in CPP 526 - so it's ultimately a proportion of a smaller dataset (subset).

gzbib commented 4 years ago

Thank you Sir for your fast reply!

But what I am confused about that if both ways are correct, then why I got different answers?

The one I used earlier is the following:

these <- (downtown$landuse == "Commercial" & downtown$yearbuilt >=1980) mean (these, na.rm=T)

However, when I partitioned the dataset into a smaller one, the result was different.

I thought we might use this subset function to make things faster and not have to go through the whole dataset to check both conditions at the same time. However, I didn't think it will make that difference in the answers.

Sorry for the inconvenience.