DS4PS / cpp-526-sum-2020

Course shell for CPP 526 Foundations of Data Science I for Summer 2020.
http://ds4ps.org/cpp-526-sum-2020/
MIT License
2 stars 1 forks source link

Lab 01 - Question 6 #6

Open gzbib opened 4 years ago

gzbib commented 4 years ago

Hello,

I was working on Question 6 and I got confused when it came to land_use and Vacant land. If we print land_use, we get Single Family and Vacant Land. For this Question. I think we are interested in Vacant Land only right? So, if I want to access Vacant Land only to get the max number of vacant lots. how can I do that? I mean there is no such variable as vacant land. Otherwise, I will be manually searching for the number after cross-searching between neighborhood and vacant land. and I am thinking if there is a more efficient way to do that.

Thanks a million.

jamisoncrawford commented 4 years ago

Great question, @gzbib!

The instructions are pretty straightforward for creating the "crosstabs" between two variables, i.e. specifying both dataset and variables (dat$land_use and dat$neighborhood) and plug those into function table(). That's perfectly acceptable (for others who are reading this), since you can search the output for the highest value in the correct land_use column.

So variable land_use is comprised of class "character" (i.e. text) values. We can see all the unique ways that land is used by using function unique():

> unique(dat$land_use)
 [1] "Vacant Land"        "Single Family"      "Commercial"        
 [4] "Parking"            "Two Family"         "Three Family"      
 [7] "Apartment"          "Schools"            "Parks"             
[10] "Multiple Residence" "Cemetery"           "Religious"         
[13] "Recreation"         "Community Services" "Utilities"         
[16] "Industrial" 

This is extremely helpful for exploring your data. It's especially important for this question, since we can quickly find the text value that explains "the most vacant lots". Function table() takes it a step further by showing us the output of unique() and the number of occurrences in the data:

> table(dat$land_use)

         Apartment           Cemetery         Commercial 
              1228                 35               2601 
Community Services         Industrial Multiple Residence 
               138                102                217 
           Parking              Parks         Recreation 
               437                 98                 55 
         Religious            Schools      Single Family 
               174                106              24392 
      Three Family         Two Family          Utilities 
               825               7259                103 
       Vacant Land 
              3732 

How do we get the count of only values in variable land_use for a specific way that the land is used, like "Cemetery" or "Community Service"? In this case, we can use table() and the dataset/variable, dat$land_use, combined with the relational operator, ==, or "exactly equal to". Then, we specify what we want the variable to equal. Here's what that looks like when looking for land_use and "Cemetery":

> table(dat$land_use == "Cemetery")

FALSE  TRUE 
41467    35 

The output is a quick tally of TRUE (land_use is exactly equal to "Cemetery") and FALSE (land_use is not equal to "Cemetery").

See if you can combine this technique by adding the second variable in table(), dat$neighborhood, and specify the correct value for dat$land_use. That should get you a tighter answer.

There's a bit more we could do, but it might be a bit overwhelming at this point in the course, since we get into how to index specific rows and columns! I'll add a bit more in a second comment with the caveat that we're getting into potentially intimidating territory! Does this help you get a more precise answer, @gzbib?

gzbib commented 4 years ago

Awesome !! That's exactly what I was looking for. I think when we learn how to index rows and columns, things will get clearer.

Thank you so much for your time Sir @jamisoncrawford

jamisoncrawford commented 4 years ago

Warning: Not necessary to know at this point; potentially overwhelming code

Okay, @gzbib, here's an example using variables land_use exactly equal (==) to "Schools", with a cross tab by the age range of the property, variable age_range:

> table(dat$age_range, dat$land_use == "Schools")

          FALSE TRUE
  1-10      289    6
  101-110  2926    1
  11-20     226    1
  111-120  3924    1
  121-130  1299    1
  131-140   591    0
  141-150   283    0
  151-160   210    0
  161-170   131    0
  171-180    51    0
  181-190    22    0
  191-200    13    0
  201-210    15    0
  21-30     753    3
  211-220     9    0
  221-230     3    0
  31-40     444   12
  41-50     987   10
  51-60    3721   31
  61-70    3689   14
  71-80    2878    8
  81-90    7126    6
  91-100   7085    4

So this data is tabular:

Alright, how do we just show the second column, TRUE values, where the property is in a certain age_range and specifically used for "Schools" in land_use?

The notation to specify rows and columns is in brackets, separated by a column: [ , ]. We enter this right after the object (in this case, our table()).

table(dat$age_range, dat$land_use == "Schools")[ , ]

This would return the entire table, since [ , ] doesn't specify anything. To specify a specific row, like row 5, we insert that in the brackets, to the left of the comma:

> table(dat$age_range, dat$land_use == "Schools")[5, ]
FALSE  TRUE 
 1299     1 

If we want to specify a particular column, like column 2, we insert that to the right of the comma:

> table(dat$age_range, dat$land_use == "Schools")[ , 2]
   1-10 101-110   11-20 111-120 121-130 131-140 
      6       1       1       1       1       0 
141-150 151-160 161-170 171-180 181-190 191-200 
      0       0       0       0       0       0 
201-210   21-30 211-220 221-230   31-40   41-50 
      0       3       0       0      12      10 
  51-60   61-70   71-80   81-90  91-100 
     31      14       8       6       4 

We can assign this table to make it easier to use in code:

x <- table(dat$age_range, dat$land_use == "Schools")[ , 2]

Then, something like max() or which.max() would tell us the age_range with the greatest number of land_use for "Schools":

> max(x)
[1] 31
> which.max(x)
51-60 
   19 
> x[19]
51-60 
   31 

Sheesh! That's a lot. You weren't meant to see this :satisfied:. But if you apply these same principles to the variables in question, you'll get to the exact answer. Cheers!

gzbib commented 4 years ago

Wow, I mean it's good to know more about how can we handle data. I totally got your point and you probably answered a question I was going to ask in the coming few days.

Thanks a million

jamisoncrawford commented 4 years ago

My pleasure! Honestly, this will all make sense by the time Lab 2 comes around.