dlkirkland / Education_Outcomes

This repository contains files which are part of a data science project which explores education outcomes in the United States.
0 stars 0 forks source link

Final Review #2

Open capktkirk opened 4 years ago

capktkirk commented 4 years ago
# Summary 

The project is an attempt to find patterns or factors from those who have navigated higher education successfully versus those who were unable to do so based off of years of gathered educational data from College Scorecard. Unfortunately modeling or graphing was not present, and no conjectures on meaning of the data was presented. The data was tidy and the explanation of what was being done with the data was solid.

# Data Preparation 

The yearly data from the school year 96-97 through the data of 17-18 are gathered as a tibble into their respective datasets (CY96, CY97, etc.) There was also a creation of a dataset dictionary for lookup and relational reasons due to the size of the dataset. The dataset was tidy by the "Tidy Data" standards, it represented a single thing per observation and was relatively easy to navigate even for its large size. The outcomes, the measurable data we want to view was created for further plotting and model making for later as well. The general language of the document was geared towards those more familiar with the statistics and math or perhaps database background. It was easy to parse from a technical standpoint as a CS major, but perhaps it could be made less technical for a broader audience.

# Modeling 

The graphing was well done, showing the logic behind the decision to focus on CY04, although no linear models were available, however the dataset has an interesting set that can be drawn from it, based on Institution and the cost of tuition backed by the type (public, private, or For-Profit) which could create some interesting models once this is fully modeled and projected. 

# Validation 

The model had no cross-validation, but given the size of the data-set it seems like there could very well be a few data subsets that would work well with this methodology. Especially with the broad range of years and institutions this dataset has.

# R Proficiency 

The R code was very well written, although there was a For-Loop that could've been applied via an sapply() function.  The anonymous function on line 280 was interesting, setting up a way to quickly reassign the data type of a column into a dataset and return the updated column. I am not sure if it was totally needed, but the solution worked with the For-loop on 293 regardless and seems to produce the desired effect.

# Communication 

The portfolio is very well described, but it is very technically described. The use of language exclusive to those familiar with databases and R might be a hurdle to clear with a more mass appeal of this portfolio. I was able to follow through everything and the logical steps made in R were communicated very succinctly. The best changes that could be made would be to elaborate what thinking was going on with the "conclusions" that were being drawn at each step. For example, on line 219 a summary(mean_outcomes) produces data, and on line 223 it is said that the data tells us quite a lot, however there is no information on what the conclusion to draw from or for the lay person how the data is being interpreted. Perhaps including the thoughts of the author would help a lay person follow more readily.

# Critical Thinking 

While the project does not contain any operationalization, the data does lend itself to questions on income, school cost, private vs public institutions, the validity of "For-Profit" institutions and the access to education that the less fortunate have. I think this project can probe those deeply once the data is sufficiently wrangled, and/or another dataset is brought in to make predictions based on income inequality or some other cohort factor. Taking the earlier years sample data as a model and then creating a predictive model would be very fascinating to see if there is a link between increased education preparedness and things such as :
* Family situation
* Familial income
* Parental collegiate level
* Zip code
dlkirkland commented 4 years ago

Data Preparation and Modeling (your score out of 20%)

 15%

Rationale for score

I feel that I tried extensively to import and work with my datasets but faced a lot of issues with attempting to join the datasets.  Though I didn't get passed the Preparation phase, I learned a great deal about R and data manipulation.

Validation and Operationalization (your score out of 20%)

 0%

Rationale for score

 I did not perform these tasks. 

R Proficiency (your score out of 20%)

 15%

Rationale for score

 I knew nothing about R before beginning this course.  It took a lot of researching and trial and error to obtain the results provided by my code.  I feel that the code I wrote was concise, efficient, and effective.  

Communication (your score out of 20%)

 15%

Rationale for score

 I tried my best to guide the viewer of my project through the process to that they'd be able to understand what I was doing and why I was doing it.  That way they'd be able to get an idea of the specific data project I was working on, and in the case of novice R users, they'd be able to learn how to perform the tasks I performed very easily.  

Critical Thinking (your score out of 20%)

 5%

Rationale for score

 I had high hopes for what I wanted to do with my datasets but was halted by an inability to get passed the data preparation phase and was therefore not able to display a lot of critical thinking that I had planned.  However, I did use a LOT of critical thinking when trying to determine which dataset observations would be valid for use, how to join the datasets so that the maximum number of observations would be preserved across the greatest number of years, as well as in the design of the algorithms I developed to work with the data in various forms.  

OVERALL, I KNOW THAT I DID NOT PERFORM MY DUE DILIGENCE FOR THIS PROJECT AND DID NOT DEVOTE THE TIME TO IT THAT I SHOULD HAVE. UNFORTUNATELY, I DIDN'T REALIZE HOW LENGTHY THE PROCESS OF WORKING WITH THE DATA WOULD BE UNTIL IT WAS TOO LATE AND BY THAT POINT I HAD ALREADY GOTTEN BEHIND. BUT I LEARNED A LOT OF LESSONS THAT I'LL TAKE WITH ME (BESIDES WORKING IN THE R ENVIRONMENT), INCLUDING: MORE EFFECTIVE TIME-MANAGEMENT TECHNIQUES; TASK PRIORITIZATION; DESIGNING AND IMPLEMENTING A PLAN OF ACTION; BETTER APPROACHES TO ALGORITHM DESIGN; AND MORE EFFECTIVE METHODS OF IDENTIFYING STRENGTHS, LIMITATIONS, AND SYNTAX OF THE PROGRAMMING ENVIRONMENT