batmanlab / BatmanLabWiki

Documents and Wiki of the lab
Apache License 2.0
3 stars 0 forks source link

[COPD-MICCAI 2018] Exploratory analysis #57

Closed sumedhasingla closed 6 years ago

sumedhasingla commented 6 years ago
jrahimik commented 6 years ago

https://github.com/sumedhasingla/COPD_Project/blob/master/Results-For_Sumedha.ipynb

kayhan-batmanghelich commented 6 years ago

@jrahimik @sumedhasingla please update incrementally here

jrahimik commented 6 years ago

This is the latest confusion matrix on exacerbation. The method should be updated, so far I am using random forest. image

kayhan-batmanghelich commented 6 years ago

@jrahimik, first of all, explain the experiment, model, labels (0,1,2,..), etc. The explanation should be sufficient. Second, you should start with simple experiment then go the complex one.

You need to show the results of the exacerbation/No exacerbation . That makes the data more balanced.

Third, I need to have a test of (Confounder, Img) vs (Cofounder).

Also, what is applied on? on the reduced dim feature or on the original features 128-dim features?

jrahimik commented 6 years ago

The result of exacerbation with no exacerbation is in ipython: image

The result above is one level after that.

We are going to have the results as follow:

Only Age, gender and pack of smoking

Age + gender + pack of smoking + image features

Age + gender + pack of smoking + FEV1

image features

This will give us a better picture of features quality

jrahimik commented 6 years ago

image This result is with random forest. I am updating with logistic regression shortly. Each of this results are with confounders.

kayhan-batmanghelich commented 6 years ago

@jrahimik I need the following:

  1. Age (A) + gender (g) + Pack of smoking (p)
  2. A+G+P + FEV1
  3. A+G+P+different image features

for linear and RF. You need to plot AUCROC and AUCPR. Also, I need the same results for one-class SVM and one-class RF as I discussed them here.

jrahimik commented 6 years ago

The results above are for random forest for 1,2 and 3. I will add the other models and the plots as you mentioned above.

jrahimik commented 6 years ago

We can define new base line as Cs + fev1 + pctEmp and compare it with Cs+fev1+PCs_image. This will inform us how much added value the image has.

kayhan-batmanghelich commented 6 years ago

@jrahimik there is something else suspicious about this plot: This is too sharp which indicates that the output results are very discrete! Did you plot this on the classification output (which is discrete)? If so, this is wrong. It should be done on the continuous decision function. In the linear case, this can be done on $w^T x + b$ on RF, it can be on prob output of the RF (unless RF produces continuous output). Stop by tomorrow.

kayhan-batmanghelich commented 6 years ago

@sumedhasingla @jrahimik these experiments should be finished and finalized by 10/3/2018.

jrahimik commented 6 years ago

The Results for linear model image

jrahimik commented 6 years ago

The Results for Random Forest image

jrahimik commented 6 years ago

The Results for Gradient Boosting:

image

jrahimik commented 6 years ago

@kayhan-batmanghelich The Linear model perform almost similar to non linear models indicating more of linear linear relationship between the features and the exacerbation. Moreover, the AUC of percent emphysema and new features are close together which indicates a similar performance of these features on predicting exacerbation.

kayhan-batmanghelich commented 6 years ago

@jrahimik this is more reasonable plots. I expected something like this. Also plot the PR plots and post them here.

jrahimik commented 6 years ago

Precision Recall plot Linear Model image

Precision Recall plot Gradient Boosting image

Precision Recall plot Random Forest image

kayhan-batmanghelich commented 6 years ago

@jrahimik thanks. PR show the results better.

Add the one class classification to the list too.

Now it is time to apply the entire pipeline on the entire data and repeat the rest of the experiments in Raul's paper on this data.

sumedhasingla commented 6 years ago

We are tracking the first 2 task in a separate issue: #52 @jrahimik Please update that issue, with the approach followed and the findings.

sumedhasingla commented 6 years ago

For replicating Raul's paper we need results on some statistical methods. We are tracking this by issue #66 . @jrahimik Please update that issue, with the approach followed and the findings.

sumedhasingla commented 6 years ago

The updated pipeline is available at: https://github.com/sumedhasingla/COPD_Project/blob/master/ExploratoryAnalysis.ipynb

Please note, the image features used there are not on cross validation set. I will update it here, once we have cross validation features.

sumedhasingla commented 6 years ago
  • [x] Plot the columns in file: /pghbio/dbmi/batmanlab/Data/COPDGene/ClinicalData/phase 1 Final 10K/phase 1 Pheno/Final10000_Phase1_Rev_28oct16.txt against each other to see which columns holds important information.
  • For FinalGold: Color by Gold score
  • Important columns: FEV1, FEV1_FVC
  • Randomly shuffle the data and plot a subset
  • [x] Figure out the columns in the file which we can predict with image features. As of now we have
  • Exacerbation
  • Mortality
  • Change in FEV1
  • Change in FEV1/FVC
  • [x] Compare image feature with
  • Only Age, gender and pack of smoking
  • Age + gender + pack of smoking + image features
  • Age + gender + pack of smoking + FEV1
  • image features
  • [x] Come up with a plot/diagram/figure to show the above comparison in most informative manner
  • [x] Finalize the models for predictions
  • Binary/ unbalanced multi label classification
  • Regression Model
  • [ ] Finalize the statistics we are going to report for each model. Example: auc, R-Square, confusion matrix

The first two tasks are tracked by issue #52 We are comparing 6 kind of features:

  1. Age-Gender-Smoking
  2. Image features (PCA)
  3. Age-Gender-Smoking-ImageFeatures
  4. Age-Gender-Smoking-FEV1
  5. Age-Gender-Smoking--FEV1-ImageFeatures
  6. Age-Gender-Smoking-FEV1-%Emph

We are hoping if our image features are good then 5 would perform better than 6. And 3 would perform better than 4.

For the classification problem, we are plotting

  1. ROC curve with AUC
  2. Precision recall curve with AP
  3. 5 folds of the best feature definition to show robustness
  4. Confusion matrix

For the regression problems we are showing

  1. RSquare

For un-balanced data we are doing

  1. Stratified KFold

Other points

  1. We are not predicting specific number of exacerbation. We are predicting 0 vs 1 where all [1-6] are merged with 1.
  2. We are doing 5 fold cross validation.
  3. We choose model parameters via experimentation.
sumedhasingla commented 6 years ago

We will track the performance of the pipeline on real image features using issue:

10 [10K subjects old COPD Data]

54 [10K subjects from new COPD data from hard-drive. Cross Validation Image Features]