Milestone #2 - Githubissues

rachelywong commented 3 years ago

Milestone #2 Tasks:

SCRIPT 1: Write a script that reads the downloaded data and performs and data cleaning/pre-processing, transforming, and/or splitting
SCRIPT 2: Write a script that performs data cleaning/pre-processing, reads in data from the first script
SCRIPT 3: Write a script which creates exploratory data visualization(s) and table(s) that are useful to help the reader/consumer understand that data set, reads in data from the second script
SCRIPT 4: Write a script that performs some statistical or machine learning analysis and summarizes the results as a figure(s) and a table(s), reads in data from the third script
SCRIPT 5: Write a literate code document which presents your analysis and findings (Jupyter notebook report), reads in data from the fourth script and maybe the third script?, written properly and shows KEY USEFUL information about EDA and statistical summaries from ML
Iterate on work done for milestone 1 as required per TA feedback

rachelywong commented 3 years ago

From discussion in class today:

write scripts in markdown but still in Python to make the written report easier
before lab tomorrow: think about a general workflow (ex// split the data, do this next, then this...)

rachelywong commented 3 years ago

Machine Learning Plan:

Split data into training and testing
Label our features (categorical, numerical, binary)
Create transformers for our features
Create models{} to test out 2 models (BASELINE DUMMY and RBF SVM and LR) - 571 LAB 4 3.2
use best. whatever to continue with our best model, based on f1 score? or mean cv score?
Hyperparameter Optimization with randomized search with best model - 571 LAB 4 3.3
Hyperparameter Optimization results - Confusion matrix , precision-recall curve, AUC? 573 LAB 1 2.7
use best model and best hyperparameters on test set
Use coeff to get top coefficients of best indicators 571 LAB 4 4.1
- extra * find the test set with the most predictive readmission outcome vs not 571 LAB 4 5.2

Also, for any functions written we need documentation and sensible tests.

jraza19 commented 3 years ago

Machine Learning Plan:

Split data into training and testing

Label our features (categorical, numerical, binary)

Create transformers for our features

Create models{} to test out 2 models (BASELINE DUMMY and RBF SVM and LR) - 571 LAB 4 3.2

use best. whatever to continue with our best model, based on f1 score? or mean cv score?

Hyperparameter Optimization with randomized search with best model - 571 LAB 4 3.3

Hyperparameter Optimization results - Confusion matrix , precision-recall curve, AUC? 573 LAB 1 2.7

use best model and best hyperparameters on test set

Use coeff to get top coefficients of best indicators 571 LAB 4 4.1

extra * find the test set with the most predictive readmission outcome vs not 571 LAB 4 5.2

Also, for any functions written we need documentation and sensible tests.

Thanks for this Rachel! I checked in with Varada regarding this work - specifically for correlated features. If we decide to go ahead with logistic regression, it will make the weights of one of the correlated features larger than the other which will make the interpretability of the coefficients in #9 difficult. But prediction will be okay. I can't remember if we can do this in RBF SVM either. I will double check.

rachelywong commented 3 years ago

As we discussed in lab:

let's stick with LR and note that there is correlation but that the correlation is not too strong between features so we still want to go ahead with LR and predict

Scripts and other docs:

SCRIPT 1 --> @wiwang
SCRIPT 2 --> take info from raw folder from script 1 and clean it and put it into the processed folder @jraza19
SCRIPT 3 --> @jraza19 @sukh2929
SCRIPT 4 --> @rachelywong @wiwang @sukh2929 @jraza19
SCRIPT 5 --> @rachelywong @jraza19 , rachel write the results/conclusion, Javairia do intro and updates
update the READme to a summary --> @jraza19
put current proposal (from current READme) into a doc to save and put in docs folder or reports folder --> @wiwang
edits from Milestone 1 --> @wiwang / all of us depending on what edits need to be made
output tables and plots into a folder (name them Figure 1. and whatever, pls coordinate) so that referencing in SCRIPT 5 is easy --> @wiwang
references.bib document --> @rachelywong @jraza19 , provide all references to rachel in GH issue, figure out how to reference links and issues

Analysis plan:

Split data into training and testing @rachelywong
Label our features (categorical, numerical, binary) @rachelywong
Create transformers for our features @rachelywong
Create models{} to test out 2 models (BASELINE DUMMY and RBF SVM and LR) - 571 LAB 4 3.2 @rachelywong
use best. whatever to continue with our best model, based on f1 score? or mean cv score? @rachelywong
Hyperparameter Optimization with randomized search with best model - 571 LAB 4 3.3 @sukh2929
Hyperparameter Optimization results - Confusion matrix , precision-recall curve, AUC? 573 LAB 1 2.7 @sukh2929
use best model and best hyperparameters on test set @sukh2929
Use coeff to get top coefficients of best indicators 571 LAB 4 4.1 @wiwang
extra --> find the test set with the most predictive readmission outcome vs not 571 LAB 4 5.2 @wiwang
store_results function --> write documentation and function tests @rachelywong --> maybe in the future make this its own script

Submission: @wiwang

rachelywong commented 3 years ago

@wiwang Please close this issue when we have all confirmed via Slack that we are good to go! and then create version 0.1.0 and submit both links to canvas (repo link and version link)! Thank you

UBC-MDS / group29

Milestone #2 #28