Closed parkerrogers closed 7 years ago
On Thu, 24 Aug 2017, Parker Rogers wrote:
This pull request should be considered after WIC, WC, and UI correction PRs.
I've fixed the Random Forest pruning in all of the Rf_probs scripts. By allowing deeper trees, the Random Forests are able to predict positive program participation. Otherwise, pruned too much, Random Forests can miss entirely all positive participation in its prediction, and assign no participation,
Correcting the pruning improved the accuracy of the RFC predictions for those in the test set that are currently participating in the program referenced to around 85-90%, depending on which program. Overall accuracy remains at around 98-99%.
What is the meaning of "accuracy at around 98%"? It is a binary outcome, right? So there are false positives and false negatives what is the rate of each separately?
dan
You can view, comment on, or merge this pull request online at:
https://github.com/open-source-economics/C-TAM/pull/47
Commit Summary
pushing admin data cleaning
finishing create_admin.py
pushing WIC imputation script
adding imputation script and Random Forest script
adding imputation report
adding README file
Merge remote-tracking branch 'upstream/master'
Merge remote-tracking branch 'upstream/master'
adding Worker's Compensation imputation script, report
adding README file and making small docstring corrections to WC_impute.py
correcting README errors
correcting UI documentation
using better method to create administrative data using continued weeks claimed from ETA 539
updating README.md file for improved UI administrative datA
removing useless line of code in create_admin.py
making changes to random forest for better positive program participation prediction
File Changes
M Housing/Rf_probs.ipynb (2)
M UI/README.md (9)
M UI/Rf_probs.py (2)
M UI/UI_Imputation_Report.pdf (0)
M UI/UI_impute.py (1)
M UI/create_admin.py (52)
A WC/README.md (86)
A WC/Rf_probs.py (196)
A WC/WC_Imputation_Report.pdf (0)
A WC/WC_impute.py (295)
A WC/arma.ipynb (414)
A WC/claims_projected (1)
A WIC/README.md (102)
A WIC/Rf_probs.py (342)
A WIC/WIC_Imputation_Report.pdf (0)
A WIC/WIC_impute.py (906)
A WIC/create_admin.py (96)
Patch Links:
https://github.com/open-source-economics/C-TAM/pull/47.patch
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVebBtiaxWldx8FE2biR_N72FED7Cks5sbfG6gaJpZM4PB_LK.gif]
@feenberg Thanks for the feedback. The meaning of "accuracy of 98%" is that 98 binary outcomes were predicted correctly in the left-out trainig set, and 2 weren't. For example: y_predicted = [1, 1, 1, 0] y_true = [0, 1, 1, 1] accuracy_score(y_true, y_pred) output = 0.5 so here the accuracy would by 50%, since 50% of the observation outcomes were predicted correctly. You are correct, these are binary outcomes, and yes, there are possible false positives and false negatives. The 85-90% (depending on the program considered) accuracy above refers to the number of true positives (or 10-15% false negatives), and we got a 98-99.99% accuracy rate when considering the number of true negatives (or <1-2% false positives).
On Thu, 24 Aug 2017, Parker Rogers wrote:
@feenberg Thanks for the feedback. The meaning of "accuracy of 98%" is that 98 binary outcomes were predicted correctly in the left-out trainig set, and 2 weren't. For example:
y_predicted = [1, 1, 1, 0] y_true = [0, 1, 1, 1] accuracy_score(y_true, y_pred) 0.5 so here the accuracy would by 50%, since 50% of the observation outcomes were predicted correctly. You are correct, these are binary outcomes, and yes, there are possible false positives and false negatives. The 85-90% (depending on the program considered) accuracy above refers to the number of true positives (or 10-15% false negatives), and we got a 98-99.99% accuracy rate when considering the number of true negatives (or <1-2% false positives).
So th false positive rate is very low, how is the false negative rate?
dan
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVWj-XBumlg0H-yysKZDDAZ4qM6Lwks5sbi2OgaJpZM4PB_LK.gif]
@feenberg there was a 10-15% false negative rate. However, since most transfer programs in the CPS are underreported, our hope is that these false negatives would be due to underreporting since they match closely the demographics and eligibility criteria of those participating. Also, the share of individuals who are participants is usually much smaller than those who aren't, so 10-15% false negative rate would be a relatively small number of individuals incorrectly identified.
Great Thanks Parker!
This pull request should be considered after WIC, WC, and UI correction PRs.
I've fixed the Random Forest pruning in all of the Rf_probs scripts. By allowing deeper trees, the Random Forests are able to predict positive program participation. Otherwise, pruned too much, Random Forests can miss entirely all positive participation in its prediction, and assign no participation,
Correcting the pruning improved the accuracy of the RFC predictions for those in the test set that are currently participating in the program referenced to around 85-90%, depending on which program. Overall accuracy remains at around 98-99%.