DS4PS / cpp-526-sum-2021

Coure shell for CPP 526.
https://ds4ps.org/cpp-526-sum-2021/
MIT License
1 stars 3 forks source link

Lab 02 Solutions - Some clarifications, please! #17

Open Sanaz-27 opened 3 years ago

Sanaz-27 commented 3 years ago

Hello, Dr. @jamisoncrawford & Dr. @lecy, I'm looking at Lab 02 solutions, and a couple of questions made me confused! can you please clarify!

Q 3.2: What proportion of commercial properties are built since 1980? Your answer: 3 2

But! isn't this counting ALL buildings since 1980 ( Commercial & non-Commercial)?

Qs about taxes! Q5, Q6... etc: Qs taxes

I know it's not applying for this Dataset, but sometimes taxes can be negative ( when you pay extra and keep it in your taxes account, I have seen this in lots of cases here in Jordan), so is using (>0) is better in this case than (!= 0) to have the needed results without adding the ones who paid extra? or this is not something that happens everywhere (paying the extra I mean) so we can see each case as different?

Q 7.1: What proportion of commercial properties are delinquent on taxes? I guess I did like most of the students and use the & operator: 7 1

And I'm a little confused about the answer! So we don't need the shared commercial & delinquent on taxes! we need to identify the commercial first then get all the delinquent on taxes from those only? I guess this is what I got from the answer posted !! but not getting the idea! 7 1 solution

Thanks in advance for any clarifications, Best, Sana

jamisoncrawford commented 3 years ago

@Sanaz-27 excellent work parsing out the solutions.

Q3.2

Good catch - yes, this should subset commercial properties first:

dt_coms <- downtown[downtown$landuse == "Commercial, ]
mean(dtcoms$yearbuilt >= 1980, na.rm = TRUE)

Q5 & Q6

There are different ways to approach this depending on your data and circumstances but checking for "membership" or out of range values should be one of your first steps in exploratory data analysis. Since we know there are no negative values here, != 0 works just fine, but certainly what you describe would require some nuance and a simple change to > 0.

Q7.1

we need to identify the commercial first then get all the delinquent on taxes from those only? I guess this is what I got from the answer posted !! but not getting the idea!

You totally get the idea! You caught my mistake on Q3.2, and this is the same logic: Creating a subset and determining a proportion from that subset!

Hope this helps!

lecy commented 2 years ago

A couple of observations:

Q7.1

Note the nuance in how the questions can be asked and how it translated to logical statements:

Q: What proportion of downtown properties are commercial?

commercial <- downtown$landuse == "Commercial
mean( commercial ) # proportion of all downtown parcels 

Q: What proportion of properties are commercial AND delinquent on taxes?

delinquent <- downtown$amtdelinqt > 0
commercial.and.delinquent <- commercial & delinquent 

# one step
commercial.and.delinquent <- downtown$landuse == "Commercial" & downtown$amtdelinqt > 0

Q: What proportion of commercial properties are delinquent on taxes?

# denominator is not all properties 
sum( commercial.and.delinquent ) / sum( commercial )
# or subset approach
dt_coms <- downtown[ downtown$landuse == "Commercial , ]
mean( dt_coms$amtdelinqt > 0 )

Pay attention to what should be in the denominator based upon which proportion is requested.

Delinquent Status

I agree that negative values should not be counted as delinquent, to the greater than expression might be more appropriate.

downtown$amtdelinqt > 0

It depends on the nature of your data, though. This expression would work with both numeric and character vectors:

downtown$amtdelinqt != "0"

Whereas the first would only work with numeric vectors. If you have noisy data where things like dollar signs in the database might cause your numeric vector to convert to a character vector (implicit casting) then the second expression could be slightly less precise but slightly more robust.

x <- c(0,0,100,0)
x
[1]   0   0 100   0
 x > 0
[1] FALSE FALSE  TRUE FALSE
x != "0"
[1] FALSE FALSE  TRUE FALSE

x <- c(x,"$50")  # add new value
x 
[1] "0"   "0"   "100" "0"   "$50"
x > 0
[1] FALSE FALSE  TRUE FALSE FALSE  # $50 is not > 0 when using the operator with text !!! 
x != "0"
[1] FALSE FALSE  TRUE FALSE  TRUE  # this is correct 

Your intuition is good. There are many ways to answer a question in R. And which is "best" depends upon your goal and your data.

Sanaz-27 commented 2 years ago

You totally get the idea! You caught my mistake on Q3.2, and this is the same logic: Creating a subset and determining a proportion from that subset!

Hope this helps!

Thank you, Dr. @jamisoncrawford. Yes, it helps a lot. I guess I was thinking about it from a different point of view, but it kinda makes sense now.

Many thanks again, Sana

Sanaz-27 commented 2 years ago

Q7.1

Note the nuance in how the questions can be asked and how it translated to logical statements:

Pay attention to what should be in the denominator based upon which proportion is requested.

Thank you, Dr. @lecy. I guess this is what confused me. I thought about it as the second option with AND, so we need to be more careful of what is needed so we can write the best code for it.

Delinquent Status

I agree that negative values should not be counted as delinquent, to the greater than expression might be more appropriate.

downtown$amtdelinqt > 0

It depends on the nature of your data, though. This expression would work with both numeric and character vectors:

downtown$amtdelinqt != "0"

Hmmm, so after all, it depends on the Database itself. There is no one better than the other, so if I have both negative values and character values in my data, should I use this code ?!

 downtown$amtdelinqt > 0 & downtown$amtdelinqt != "0" 

Your intuition is good. There are many ways to answer a question in R. And which is "best" depends upon your goal and your data.

Thank you, yes I figured that through the labs, especially Lab 03 & Lab 04, there are many ways to do things in R, depends also on what kind of results you want to show and how you want to show it.

Many thanks again, Best, Sana

lecy commented 2 years ago

You are asking the right questions.

Making sure you understand the human question before translating to code is important. I think you will find that many times the human doesn't always understand the human question, so it might be the analyst's job to translate some vague goal into very precise terms.

For example, in the data wrangling exercise the mayor wants to know how to address traffic accidents, which is a vague goal. You need to know whether the issue is property damage (number of accidents), injury (which accidents cause harm), or loss of life (which accidents cause deaths)?

These simple measurement decisions can have a profound effect on how you think about the problem and design interventions. Or in the evaluation context, they relate to the idea of meaningful versus unrealistic and misleading counterfactuals.

These early labs focus less on which approach is most correct, and more on whether you can manipulate vectors and construct logical statements. But definitely pay attention to nuances in how the questions are asked.


If you combine two logical statements using an AND statement the rule is that the value would have to evaluate to TRUE in both cases. Not so for the OR operator.

x1 & x2
T x T = T
T x F = F
F x T = F
F x F = F

x1 | x2
T x T = T
T x F = T
F x T = T
F x F = F

So combining the two statements would not help if you don't know the data type because '$50' would evaluate as FALSE in the first case because of implicit casting:

downtown$amtdelinqt > 0 & downtown$amtdelinqt != "0" 

This is better:

# EITHER OR IS TRUE
downtown$amtdelinqt > 0 |  downtown$amtdelinqt != "0" 

But would still fail with something like '$0' or '-10'. So data refinement would be better than a compound statement.

In 527 you will learn about control structures and you could also do something like this:

if( class(downtown$amtdelinqt)  == "numeric" )
{
   downtown$amtdelinqt > 0
}
if( class(downtown$amtdelinqt) == "character" )
{
   downtown$amtdelinqt != "0" 
}