DS4PS / cpp-528-spr-2020

Course shell for CPP 528 Foundations of Data Science III for Spring 2020.
http://ds4ps.org/cpp-528-spr-2020/
2 stars 0 forks source link

LAB-03 #18

Open sunaynagoel opened 4 years ago

sunaynagoel commented 4 years ago

I have a few question related to Lab 03.

PART 1 Q.1. In the excel sheet, the category for rows 2-16 are ids and tract attributes. When running filter based on category these are picked up as well, along with few "NA". Should i be worried about that?

unique(data_dictionary$category)
[1] "id"                     NA                       "tract attribute"       
 [4] "age-race"               "age"                    "ethnicity"             
 [7] "race"                   "ses"                    "race-ses"              
[10] "demographics"           "age-ses"                "housing"               
[13] "housing-age"            "marital status"         "housing-marital status"
[16] "demographics-ses"      

Q2. What is the difference between Race and ethnicity in this case?

Q3. I could no find the definition for Rows 183-186 .

pop | popsf3 | demographics pop | popsf4 | demographics pop | popsp1 | demographics pop | popsp2 | demographics

PART II Q.1. For filtering the data frame depending on category or group provided; should I create a vector containing all the unique categories and sample one at a time to return the filtered data frame ?

cenuno commented 4 years ago

Hi Nina,

Part 1.

Q1. This is most likely a side affect of you creating empty rows while making your excel file. You can either recreate that file without empty rows or you can filter out records with NA values as you read in the CSV.

Part 2.

Q1. You could create an if else statement that breaks the execution of the function if the given string is not found in your group or category. That way you don’t need to manually supply the names ahead of time you can do the inspection at time of function execution.

— Cristian E. Nuno


From: Sunayna notifications@github.com Sent: Wednesday, April 8, 2020 8:14:12 AM To: DS4PS/cpp-528-spr-2020 cpp-528-spr-2020@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [DS4PS/cpp-528-spr-2020] LAB-03 (#18)

I have a few question related to Lab 03.

PART 1 Q.1. In the excel sheet, the category for rows 2-16 are ids and tract attributes. When running filter based on category these are picked up as well, along with few "NA". Should i be worried about that?

unique(data_dictionary$category)

[1] "id" NA "tract attribute" [4] "age-race" "age" "ethnicity" [7] "race" "ses" "race-ses" [10] "demographics" "age-ses" "housing" [13] "housing-age" "marital status" "housing-marital status" [16] "demographics-ses"

Q2. What is the difference between Race and ethnicity in this case?

Q3. I could no find the definition for Rows 183-186 .

pop | popsf3 | demographics pop | popsf4 | demographics pop | popsp1 | demographics pop | popsp2 | demographics

PART II Q.1. For filtering the data frame depending on category or group provided; should I create a vector containing all the unique categories and sample one at a time to return the filtered data frame ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/DS4PS/cpp-528-spr-2020/issues/18, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFZB2S7UKGJNKI6A5A5XFLLRLSIEJANCNFSM4MEAJXGQ.

lecy commented 4 years ago

Q2. What is the difference between Race and ethnicity in this case?

https://www.census.gov/mso/www/training/pdf/race-ethnicity-onepager.pdf

Q3. I could no find the definition for Rows 183-186 .

v1 v2 description
pop popsf3 demographics
pop popsf4 demographics
pop popsp1 demographics
pop popsp2 demographics

These are different full population estimates from specific tables (SF3, SF4, SP1, SP2).

For PART II, the use case would be that a person wants to add variables from a specific class like housing, demographics, employment, etc. The function should allow them to preview available variables and their coverage, and make it easy to incorporate them (so include things like the actual variable name they need to include).

lecy commented 4 years ago

Race and ethnicity might not be correct on the example file.

There is also sometimes a distinction between race, ethnicity and national origin, e.g. plain white versus European immigrant white, Hispanic from Mexico vs Hispanic from the Caribbean. Etc.

I found these categories the most confusing!

castower commented 4 years ago

Hello @cenuno @lecy,

Is there a particular reason why there are two cbsa, metdiv, placefp, and ccflag rows? Should I keep both or delete one of the rows?

Thanks! Courtney

lecy commented 4 years ago

@castower They are redundant, I remember adding those because I thought they were missing. But looks like I just duplicated rows.

I think these four IDs at the end are all duplicates:

You see variables like mar repeated three times. I believe the variable names in the original sets were just misspelled (see the "root2" column, which was the original). You will want to reduce these to a single row, but double-check the variable names have been updated in the dataset.

Note the one that went from mar12 to x.12.mar. That is likely an Excel auto-correct error because it has AI tech that looks for patterns that look like dates and try to convert them to a date format when it finds them. It is the equivalent of an implicit casting error in R. One more reason to avoid data cleaning in Excel if you can avoid it!

https://www.sciencemag.org/news/2016/08/one-five-genetics-papers-contains-errors-thanks-microsoft-excel