Watts-College / cpp-528-spr-2022

https://watts-college.github.io/cpp-528-spr-2022/
0 stars 0 forks source link

Data Wrangling #3

Open yukicruz opened 2 years ago

yukicruz commented 2 years ago

There is no one perfect way to structure data. Instead, ask and answer questions to provide clarity to your discussions and to form the reasoning behind your eventual conclusions. Include these in the README.md files so your team can refer to them in the future.

Ask Yourself Questions As You Structure

An example question to ask could be, "What will the differences be between the categories that our team decides?" For instance, there is likely definitional overlap between "ethnicity" and "race" within the raw dictionary as well as similar overlap between "Ethnicity and Immigration" and "Race or Age by Race" in the LTDB Codebook. There may also be overlap between "Socioeconomic Status" and "Housing, Age, and Marital Status" or other categories. There are long debates about word usage, and terms used to describe ideas often change or evolve over time. A human determined the categories within the TLDB Codebook and, given another opportunity to restructure the categories and wording, would likely do them differently today. Considering why and how categories separate items will provide better clarity as you work with data.

Another useful question is, "How do the total number of items in each category compare to the other categories?" Pretend you have a bedroom dresser with eight drawers. Seven drawers are pretty empty but one drawer can't close because it is overflowing with a mess of items. There must be a better way to organize your items within that dresser. Similarly, if your dictionary has eight categories and 85% of the items fall within one category, then you should ask "Why is that?" or "Is this really the best way to split items by category?" or something else to achieve a cleaner, more uniform structure.

Simple Is Often Better

The LTDB Codebook provides a reasonable start to category labels. Feel free to use it as a guide but don't consider it carved in stone. Don't overcomplicate structure--you don't need 214 categories when you have 214 rows, nor do you need multiple columns of categories and even more columns for sub-categories. Simple is often better.

If you get an itch to sort data, R provides flexibility without permanently impacting your dataset. You can play with R on branches and virtual machines to your heart's content and throw it out the window if you can't Ctrl+Z your way out of it.

Do your best to ensure your data is structured and clean. Lab 02 will have you sort by keywords within the Definition column and by years. Watch for words that can be standardized (e.g., "pct" vs "percentage", "PI" vs "Pacific Islander", etc.) within the Definition column. Don't overstandardize either ("1970.f" is different from "1970.s"). Clean, standardized data will help you not miss data during future analyses.