RF Pruning Fix - Githubissues

parkerrogers commented 7 years ago

This pull request should be considered after WIC, WC, and UI correction PRs.

I've fixed the Random Forest pruning in all of the Rf_probs scripts. By allowing deeper trees, the Random Forests are able to predict positive program participation. Otherwise, pruned too much, Random Forests can miss entirely all positive participation in its prediction, and assign no participation,

Correcting the pruning improved the accuracy of the RFC predictions for those in the test set that are currently participating in the program referenced to around 85-90%, depending on which program. Overall accuracy remains at around 98-99%.

feenberg commented 7 years ago

On Thu, 24 Aug 2017, Parker Rogers wrote:

This pull request should be considered after WIC, WC, and UI correction PRs.

I've fixed the Random Forest pruning in all of the Rf_probs scripts. By allowing deeper trees, the Random Forests are able to predict positive program participation. Otherwise, pruned too much, Random Forests can miss entirely all positive participation in its prediction, and assign no participation,

Correcting the pruning improved the accuracy of the RFC predictions for those in the test set that are currently participating in the program referenced to around 85-90%, depending on which program. Overall accuracy remains at around 98-99%.

What is the meaning of "accuracy at around 98%"? It is a binary outcome, right? So there are false positives and false negatives what is the rate of each separately?

dan

You can view, comment on, or merge this pull request online at:
https://github.com/open-source-economics/C-TAM/pull/47
Commit Summary
pushing admin data cleaning

finishing create_admin.py

pushing WIC imputation script

adding imputation script and Random Forest script

adding imputation report

adding README file

Merge remote-tracking branch 'upstream/master'

Merge remote-tracking branch 'upstream/master'

adding Worker's Compensation imputation script, report

adding README file and making small docstring corrections to WC_impute.py

correcting README errors

correcting UI documentation

using better method to create administrative data using continued weeks claimed from ETA 539

updating README.md file for improved UI administrative datA

removing useless line of code in create_admin.py

making changes to random forest for better positive program participation prediction

File Changes

M Housing/Rf_probs.ipynb (2)

M UI/README.md (9)

M UI/Rf_probs.py (2)

M UI/UI_Imputation_Report.pdf (0)

M UI/UI_impute.py (1)

M UI/create_admin.py (52)

A WC/README.md (86)

A WC/Rf_probs.py (196)

A WC/WC_Imputation_Report.pdf (0)

A WC/WC_impute.py (295)

A WC/arma.ipynb (414)

A WC/claims_projected (1)

A WIC/README.md (102)

A WIC/Rf_probs.py (342)

A WIC/WIC_Imputation_Report.pdf (0)

A WIC/WIC_impute.py (906)

A WIC/create_admin.py (96)

Patch Links:

https://github.com/open-source-economics/C-TAM/pull/47.patch

https://github.com/open-source-economics/C-TAM/pull/47.diff

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVebBtiaxWldx8FE2biR_N72FED7Cks5sbfG6gaJpZM4PB_LK.gif]

parkerrogers commented 7 years ago

@feenberg Thanks for the feedback. The meaning of "accuracy of 98%" is that 98 binary outcomes were predicted correctly in the left-out trainig set, and 2 weren't. For example: y_predicted = [1, 1, 1, 0] y_true = [0, 1, 1, 1] accuracy_score(y_true, y_pred) output = 0.5 so here the accuracy would by 50%, since 50% of the observation outcomes were predicted correctly. You are correct, these are binary outcomes, and yes, there are possible false positives and false negatives. The 85-90% (depending on the program considered) accuracy above refers to the number of true positives (or 10-15% false negatives), and we got a 98-99.99% accuracy rate when considering the number of true negatives (or <1-2% false positives).

feenberg commented 7 years ago

On Thu, 24 Aug 2017, Parker Rogers wrote:

@feenberg Thanks for the feedback. The meaning of "accuracy of 98%" is that 98 binary outcomes were predicted correctly in the left-out trainig set, and 2 weren't. For example:

              y_predicted = [1, 1, 1, 0]
              y_true = [0, 1, 1, 1]
              accuracy_score(y_true, y_pred)
              0.5
              so here the accuracy would by 50%, since 50% of
              the observation outcomes were predicted correctly.
              You are correct, these are binary outcomes, and
              yes, there are possible false positives and false
              negatives. The 85-90% (depending on the program
              considered) accuracy above refers to the number of
              true positives (or 10-15% false negatives), and we
              got a 98-99.99% accuracy rate when considering the
              number of true negatives (or <1-2% false
              positives).

So th false positive rate is very low, how is the false negative rate?

dan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.[AHvQVWj-XBumlg0H-yysKZDDAZ4qM6Lwks5sbi2OgaJpZM4PB_LK.gif]

parkerrogers commented 7 years ago

@feenberg there was a 10-15% false negative rate. However, since most transfer programs in the CPS are underreported, our hope is that these false negatives would be due to underreporting since they match closely the demographics and eligibility criteria of those participating. Also, the share of individuals who are participants is usually much smaller than those who aren't, so 10-15% false negative rate would be a relatively small number of individuals incorrectly identified.

Amy-Xu commented 7 years ago

Great Thanks Parker!

PSLmodels / C-TAM

RF Pruning Fix #47