DS4PS / cpp-526-spr-2021

Course shell for Foundations of Data Science I
https://ds4ps.org/cpp-526-spr-2021/
MIT License
1 stars 2 forks source link

Lab 2 - Q. 4 #4

Open sjone128 opened 3 years ago

sjone128 commented 3 years ago

4 Question: What proportion of commercial properties are built since 1980?

My code isn't returning any values for the commercial vector. However, if I select a different land use vector, for example, "Parking" it works. See the example below.

result <- downtown$landuse ==  "Commercial" & downtown$yearbuilt > "1980"
mean ( result)

result <- downtown$landuse ==  "Commercial" & downtown$yearbuilt > "1980"
mean ( result )
[1] NA

Versus, replacing commercial with Parking

result <- downtown$landuse ==  "Parking" & downtown$yearbuilt > "1980"
mean ( result )

result <- downtown$landuse ==  "Parking" & downtown$yearbuilt > "1980"
> mean ( result )
[1] 0.01542416

*Important to note, changing the capitalization of the commercial vector returns [1] 0

Any help would be really appreciated! Also, I looked for a label for this post and couldn't find it, sorry for not following the format. I hope that I described the problem correctly.

kpalmer7113 commented 3 years ago

@sjone128 The downtown$yearbuilt vector has NA values so there are probably commercial properties without data on the year they were built which is giving you the NA result. I added , na.rm = TRUE after 1980 argument to remove the NA's and get the proportion. Your code also excludes commercial properties that were built in 1980. I used >= to include properties built in 1980 and after. Hope this helps!

jamisoncrawford commented 3 years ago

@kpalmer7113 thanks for jumping in to help! Top advice! 🔥

@sjone128 this particular problem is addressed in issue #2 and you may find that helpful as you may want to approach this issue differently.

It's absolutely correct that calling function mean() when NA values exist will result in an NA unless using na.rm = TRUE as an argument in mean().

Lastly, while it still works because R will "coerce" 1980 into a numeric value (because you use relational operators like > and ==), putting it in quotes will first tell R that it is a character string (a different class entirely!). Just a heads up!

jamisoncrawford commented 3 years ago

@sjone128 please let us know if you crack this!

sjone128 commented 3 years ago

@kpalmer7113 @jamisoncrawford Brilliant advice, thank you both so much! Incorporating both pieces of advice I was able to crack it on my lunch break! What a relief. I also went back to look at the other questions asking for proportions and reworked them to create a subset.

lecy commented 3 years ago

Fun fact: logical comparisons with strings (characters) performs alphabetization instead of determination of magnitude.

When you alphabetize things, something is greater if it occurs later in the alphabet. Thus you get these weird behaviors if you are not paying attention to data types:

5 > 100
[1] FALSE
"5" > "100"
[1] TRUE
jamisoncrawford commented 3 years ago

I was getting some interesting behavior with NULL values today @lecy - good reminder that R treats things in very unexpected (but ultimately sensible ways!).