DS4PS / cpp-526-sum-2020

Course shell for CPP 526 Foundations of Data Science I for Summer 2020.
http://ds4ps.org/cpp-526-sum-2020/
MIT License
2 stars 1 forks source link

Lab 2 - Question 7 #11

Open malmufre opened 4 years ago

malmufre commented 4 years ago

Hello I have seen that question 7 has 2 parts however I couldn't understand the difference between the two. These are the questions below : What proportion of commercial properties are delinquent on taxes? *What proportion of delinquent tax bills are owed by commercial parcels?**** I have done this code: mean(downtown$landuse=="Commercial"& downtown$amtdelinqt!=0) to get the proportion of commercial properties that are delinquent on taxes however the proportion of delinquent tax bills owed to commercial parcels seem to have the same meaning. could you please explain this question further ?

Thanks!

jamisoncrawford commented 4 years ago

Hi @malmufre, sure thing! Looks like you nailed the first part.

This is a matter of finding the sum() of all delinquent taxes owed in Syracuse, NY. It's probably a big number!

Then, find the sum() of all delinquent taxes owed by "Commercial" properties.

After that, you can just use division / to determine the proportion - is that helpful and do you see the difference?

malmufre commented 4 years ago

Worked just fine . Thanks! However I did not get a very big number when I ran this code for the sum of all the delinquent taxes owed in Syracuse. these.deliqnt<- (downtown$amtdelinqt!=0) sum(these.deliqnt,na.rm=T)

jamisoncrawford commented 4 years ago

You're welcome! So here's the somewhat big number:

> sum(downtown$amtdelinqt, na.rm = TRUE)
[1] 5045969

So ~$5m in taxes owed! This should be the denominator for the second question. The numerator should be the amount of tax dollars delinquent on "Commercial" landuse properties!

malmufre commented 4 years ago

I didn't quite get why we removed != from the code since we want all the delinquent tax bills , or does that mean that we need to include even tax bills that equal to 0? On another note, even in question 5 , when we were asked about the number of properties delinquent on taxes ,would that also mean that our code should not have!= in it since we need all properties even the ones that are delinquent on taxes with the value of 0?

Thank you

jamisoncrawford commented 4 years ago

I didn't quite get why we removed != from the code since we want all the delinquent tax bills

Hm, I suppose this all depends on how your interpret the second question - is it the number of properties or is it the amount in tax dollars that are delinquent? Suppose these instructions came from an executive who was relatively precise with some requests, but somewhat vague in a couple of them? What would you do? (This isn't the purpose of the assignment, but you'll experience this quite often in your data analytic career).

What proportion of delinquent tax bills are owed by commercial parcels?

This could be interpreted as how many bills are owed by "Commercial" properties out of all bills owed? However, we don't have a variable that lists the number of bills, only the dollar amount of delinquent taxes.

My interpretation would be that, for the first question, we're asked to find the proportion of delinquent "Commercial" properties out of all "Commercial" properties, so this is:

delinquent_commercial_properties / all_commercial_properties

Hence, != applies because it disqualifies the property from being part of the delinquent properties in the numerator of the above equation.

In the second question, since we don't know the number of individual bills, I interpret this as a proportion of taxes owed by commercial properties / all taxes owed - hence, it's a proportion of a dollar amount, not properties, so we don't need != to disqualify non-delinquent properties anymore.

or does that mean that we need to include even tax bills that equal to 0?

That's the tricky part - we don't know tax bills equal to 0, we just know that 0 taxes are owed - so we don't need it for the second question, which is only concerned with delinquent taxes owed.

Does that make sense? (Seriously, if not I'm happy to meet and talk it through!).

On another note, even in question 5 , when we were asked about the number of properties delinquent on taxes, would that also mean that our code should not have != in it since we need all properties even the ones that are delinquent on taxes with the value of 0?

Since this deals with properties that either are or are not delinquent, != or > 0 will result in TRUE, if the property is delinquent, or FALSE if it isn't. Since the unit of analysis in Q5 is "properties", rather than "dollars", this still applies.

Let me know if I'm making this worse! (Haha).

malmufre commented 4 years ago

It's clear haha Thanks very much! For the second question, What I got is basically that we removed the!= or > since we need to include all tax amount owed disregarding that fact that some of the properties will have a tax delinquency of 0. I have tried it and I got a very smal percentage, so I hope I am doing this right. I just would like to ask if {r}length(downtown$amtdelinqt) would be a variable that lists the number of bills. As for Question 5, I got that we need the properties that are delinquent so we are disregarding properties that are not delinquent on taxes by adding the !=

Thanks for the thorough explanation.

jamisoncrawford commented 4 years ago

I just would like to ask if {r}length(downtown$amtdelinqt) would be a variable that lists the number of bills.

Do you mean in-line variables outside of code chunks? If so, it should look like this:

`r variable`

...rather than...

{r} variable

Another note: length(downtown$amtdelinqt) will only get you the number of "elements" in the "vector" (that is, the total amount of values in the variable). In other words, this will just tell you the number of rows/properties. Technically, it's not the number of bills, unless each property has only one bill - then that's correct.

To find the total amount of delinquent taxes, in dollars, you can add the values with sum() (rather than length()). If you wanted to do this in in-line code, you can type the following outside of code chunks:

`r sum(downtown$amtdelinqt, na.rm = TRUE)` 

To take it further, you can actually format it as currency using package scales. In RStudio (but not in the .Rmd script), type the following:

install.packages("scales")

In the .Rmd script, include this in a code chunk at the beginning:

library(scales)

Then, you can format a single value or an entire variable using the dollar() function. For in-line variables, outside of code chunks, it looks like this:

`r dollar(sum(downtown$amtdelinqt, na.rm = TRUE))` 

Hope this helps!