Daniel Lin (danielglin) and Krish Andivel (Gopsathvik)
Income inequality in the United States is increasing.
We are interested to find the strongest predictors are of a US adult having high income or low income to gain insight on the drivers of income inequality.
What are the strongest predictors of whether a US adult has an income of more than $50,000 or less than $50,000?
The data we used comes from the 1994 US Census, prepared by Barry Becker. The data is hosted here. The data has demographic, education, and employment information.
We used decision tree from the python scikit-learn
package to answer the problem statement. There was initial data clean up for missing values.The exploratory analysis was done on the clean data before categorical features were encoded. In order to model the features using decision tree the categorical features were encoded to dummy variables.
We tuned the model for various trials of tree depth and perfomed cross validation for each trial of tree depth. Based on the highest cross validation score the decision tree was trained. We used feature_importances_
in the python scikit-learn
package to determine the best feature for prediction of income levels.
The full report is here. The load_data.py script in the src folder loads the dataset, which is saved in census_data.csv in the data folder.
The Docker repository for this project is here
Variable | Description |
---|---|
workclass | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Withoutpay, Never-worked. 69.4% values are Private |
education | Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool |
education-num | continuous |
fnlwgt | continuous |
marital-status | Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Marriedspouse-absent, Married-AF-spouse |
occupation | Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Privhouse-serv, Protective-serv, Armed-Forces |
relationship | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried |
race | White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black |
capital-gain | continuous |
capital-loss | continuous |
hours-per-week | continuous |
sex | Female/Male |
native-country | United-States, Cambodia, England, Puerto-Rico, Canada, Germany, OutlyingUS(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Colombia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, ElSalvador, Trinidad and Tobago, Peru, Hong, Holland-Netherlands |
Clone this repo, and using the command line navigate to the root of this project
Run the following command to produce the report:
make all
Run the following command to clean previous result
make clean
The report is generated under the report/
directory
Use the docker image for this repository by navigating to the root of this project on your computer
Then run docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/income dglin123/income-predictors-for-us-adults cd '/home/income' make all
, replacing PATH_ON_YOUR_COMPUTER with the absolute path to the root of this project
To clear out the files associated with the analysis, run docker run --rm -v PATH_ON_YOUR_COMPUTER:/home/income dglin123/income-predictors-for-us-adults cd '/home/income' make clean
We performed our analysis as per the workflow below:
seaborn
to perform EDA graphsIn order to reproduce our findings, both makefile
follows the above mentioned work flow to generate the report. The scripts would run load_data.py
to generate cleaned data. EDA_Census.py
would perform data analysis and produced result data. census_decision_tree.py
would use cleaned data to perform machine learning and summary_viz.py
would also plot the feature importances from the result. The report generation uses knitr
, Summary_Report.Rmd
would generate our final report in the markdown
format.
The dependency graph of outputs, scripts and inputs used in Makefile
.
R & R libraries:
R
, version 3.5.1rmarkdown
, version 1.10knitr
, version 1.20Python & Python libraries:
Python
, version 3.7.0matplotlib
, version 2.2.3numpy
, version 1.15.1seaborn
, version 0.9.0pandas
, version 0.23.4scikit-learn
, version 0.19.2argparse
, part of Python standard library