sumedhasingla commented 6 years ago

[x] Plot the columns in file: /pghbio/dbmi/batmanlab/Data/COPDGene/ClinicalData/phase 1 Final 10K/phase 1 Pheno/Final10000_Phase1_Rev_28oct16.txt against each other to see which columns holds important information.
For FinalGold: Color by Gold score
Important columns: FEV1, FEV1_FVC
Randomly shuffle the data and plot a subset
[x] Figure out the columns in the file which we can predict with image features. As of now we have
Exacerbation
Mortality
Change in FEV1
Change in FEV1/FVC
[x] Compare image feature with
Only Age, gender and pack of smoking
Age + gender + pack of smoking + image features
Age + gender + pack of smoking + FEV1
image features
[x] Come up with a plot/diagram/figure to show the above comparison in most informative manner
[x] Finalize the models for predictions
Binary/ unbalanced multi label classification
Regression Model
[x] Finalize the statistics we are going to report for each model. Example: auc, R-Square, confusion matrix

jrahimik commented 6 years ago

https://github.com/sumedhasingla/COPD_Project/blob/master/Results-For_Sumedha.ipynb

kayhan-batmanghelich commented 6 years ago

@jrahimik @sumedhasingla please update incrementally here

jrahimik commented 6 years ago

This is the latest confusion matrix on exacerbation. The method should be updated, so far I am using random forest.

kayhan-batmanghelich commented 6 years ago

@jrahimik, first of all, explain the experiment, model, labels (0,1,2,..), etc. The explanation should be sufficient. Second, you should start with simple experiment then go the complex one.

You need to show the results of the exacerbation/No exacerbation . That makes the data more balanced.

Third, I need to have a test of (Confounder, Img) vs (Cofounder).

Also, what is applied on? on the reduced dim feature or on the original features 128-dim features?

jrahimik commented 6 years ago

The result of exacerbation with no exacerbation is in ipython:

The result above is one level after that.

We are going to have the results as follow:

Only Age, gender and pack of smoking

Age + gender + pack of smoking + image features

Age + gender + pack of smoking + FEV1

image features

This will give us a better picture of features quality

jrahimik commented 6 years ago

This result is with random forest. I am updating with logistic regression shortly. Each of this results are with confounders.

kayhan-batmanghelich commented 6 years ago

@jrahimik I need the following:

Age (A) + gender (g) + Pack of smoking (p)
A+G+P + FEV1
A+G+P+different image features

for linear and RF. You need to plot AUCROC and AUCPR. Also, I need the same results for one-class SVM and one-class RF as I discussed them here.

jrahimik commented 6 years ago

The results above are for random forest for 1,2 and 3. I will add the other models and the plots as you mentioned above.

jrahimik commented 6 years ago

We can define new base line as Cs + fev1 + pctEmp and compare it with Cs+fev1+PCs_image. This will inform us how much added value the image has.

kayhan-batmanghelich commented 6 years ago

@jrahimik there is something else suspicious about this plot: This is too sharp which indicates that the output results are very discrete! Did you plot this on the classification output (which is discrete)? If so, this is wrong. It should be done on the continuous decision function. In the linear case, this can be done on $w^T x + b$ on RF, it can be on prob output of the RF (unless RF produces continuous output). Stop by tomorrow.

kayhan-batmanghelich commented 6 years ago

@sumedhasingla @jrahimik these experiments should be finished and finalized by 10/3/2018.

jrahimik commented 6 years ago

The Results for linear model

jrahimik commented 6 years ago

The Results for Random Forest

jrahimik commented 6 years ago

The Results for Gradient Boosting:

jrahimik commented 6 years ago

@kayhan-batmanghelich The Linear model perform almost similar to non linear models indicating more of linear linear relationship between the features and the exacerbation. Moreover, the AUC of percent emphysema and new features are close together which indicates a similar performance of these features on predicting exacerbation.

kayhan-batmanghelich commented 6 years ago

@jrahimik this is more reasonable plots. I expected something like this. Also plot the PR plots and post them here.

jrahimik commented 6 years ago

Precision Recall plot Linear Model

Precision Recall plot Gradient Boosting

Precision Recall plot Random Forest

kayhan-batmanghelich commented 6 years ago

@jrahimik thanks. PR show the results better.

Add the one class classification to the list too.

Now it is time to apply the entire pipeline on the entire data and repeat the rest of the experiments in Raul's paper on this data.

sumedhasingla commented 6 years ago

We are tracking the first 2 task in a separate issue: #52 @jrahimik Please update that issue, with the approach followed and the findings.

sumedhasingla commented 6 years ago

For replicating Raul's paper we need results on some statistical methods. We are tracking this by issue #66 . @jrahimik Please update that issue, with the approach followed and the findings.

sumedhasingla commented 6 years ago

The updated pipeline is available at: https://github.com/sumedhasingla/COPD_Project/blob/master/ExploratoryAnalysis.ipynb

Please note, the image features used there are not on cross validation set. I will update it here, once we have cross validation features.

sumedhasingla commented 6 years ago

[x] Plot the columns in file: /pghbio/dbmi/batmanlab/Data/COPDGene/ClinicalData/phase 1 Final 10K/phase 1 Pheno/Final10000_Phase1_Rev_28oct16.txt against each other to see which columns holds important information.

For FinalGold: Color by Gold score

Important columns: FEV1, FEV1_FVC

Randomly shuffle the data and plot a subset

[x] Figure out the columns in the file which we can predict with image features. As of now we have

Exacerbation

Mortality

Change in FEV1

Change in FEV1/FVC

[x] Compare image feature with

Only Age, gender and pack of smoking

Age + gender + pack of smoking + image features

Age + gender + pack of smoking + FEV1

image features

[x] Come up with a plot/diagram/figure to show the above comparison in most informative manner

[x] Finalize the models for predictions

Binary/ unbalanced multi label classification

Regression Model

[ ] Finalize the statistics we are going to report for each model. Example: auc, R-Square, confusion matrix

The first two tasks are tracked by issue #52 We are comparing 6 kind of features:

Age-Gender-Smoking
Image features (PCA)
Age-Gender-Smoking-ImageFeatures
Age-Gender-Smoking-FEV1
Age-Gender-Smoking--FEV1-ImageFeatures
Age-Gender-Smoking-FEV1-%Emph

We are hoping if our image features are good then 5 would perform better than 6. And 3 would perform better than 4.

For the classification problem, we are plotting

ROC curve with AUC
Precision recall curve with AP
5 folds of the best feature definition to show robustness
Confusion matrix

For the regression problems we are showing

RSquare

For un-balanced data we are doing

Stratified KFold

Other points

We are not predicting specific number of exacerbation. We are predicting 0 vs 1 where all [1-6] are merged with 1.
We are doing 5 fold cross validation.
We choose model parameters via experimentation.

sumedhasingla commented 6 years ago

We will track the performance of the pipeline on real image features using issue:

10 [10K subjects old COPD Data]

54 [10K subjects from new COPD data from hard-drive. Cross Validation Image Features]

batmanlab / BatmanLabWiki

[COPD-MICCAI 2018] Exploratory analysis #57

10 [10K subjects old COPD Data]

54 [10K subjects from new COPD data from hard-drive. Cross Validation Image Features]