Closed sumedhasingla closed 6 years ago
@jrahimik @sumedhasingla please update incrementally here
This is the latest confusion matrix on exacerbation. The method should be updated, so far I am using random forest.
@jrahimik, first of all, explain the experiment, model, labels (0,1,2,..), etc. The explanation should be sufficient. Second, you should start with simple experiment then go the complex one.
You need to show the results of the exacerbation/No exacerbation . That makes the data more balanced.
Third, I need to have a test of (Confounder, Img) vs (Cofounder).
Also, what is applied on? on the reduced dim feature or on the original features 128-dim features?
The result of exacerbation with no exacerbation is in ipython:
The result above is one level after that.
We are going to have the results as follow:
Only Age, gender and pack of smoking
Age + gender + pack of smoking + image features
Age + gender + pack of smoking + FEV1
image features
This will give us a better picture of features quality
This result is with random forest. I am updating with logistic regression shortly. Each of this results are with confounders.
@jrahimik I need the following:
for linear and RF. You need to plot AUCROC and AUCPR. Also, I need the same results for one-class SVM and one-class RF as I discussed them here.
The results above are for random forest for 1,2 and 3. I will add the other models and the plots as you mentioned above.
We can define new base line as Cs + fev1 + pctEmp and compare it with Cs+fev1+PCs_image. This will inform us how much added value the image has.
@jrahimik there is something else suspicious about this plot: This is too sharp which indicates that the output results are very discrete! Did you plot this on the classification output (which is discrete)? If so, this is wrong. It should be done on the continuous decision function. In the linear case, this can be done on $w^T x + b$ on RF, it can be on prob output of the RF (unless RF produces continuous output). Stop by tomorrow.
@sumedhasingla @jrahimik these experiments should be finished and finalized by 10/3/2018.
The Results for linear model
The Results for Random Forest
The Results for Gradient Boosting:
@kayhan-batmanghelich The Linear model perform almost similar to non linear models indicating more of linear linear relationship between the features and the exacerbation. Moreover, the AUC of percent emphysema and new features are close together which indicates a similar performance of these features on predicting exacerbation.
@jrahimik this is more reasonable plots. I expected something like this. Also plot the PR plots and post them here.
Precision Recall plot Linear Model
Precision Recall plot Gradient Boosting
Precision Recall plot Random Forest
@jrahimik thanks. PR show the results better.
Add the one class classification to the list too.
Now it is time to apply the entire pipeline on the entire data and repeat the rest of the experiments in Raul's paper on this data.
We are tracking the first 2 task in a separate issue: #52 @jrahimik Please update that issue, with the approach followed and the findings.
For replicating Raul's paper we need results on some statistical methods. We are tracking this by issue #66 . @jrahimik Please update that issue, with the approach followed and the findings.
The updated pipeline is available at: https://github.com/sumedhasingla/COPD_Project/blob/master/ExploratoryAnalysis.ipynb
Please note, the image features used there are not on cross validation set. I will update it here, once we have cross validation features.
- [x] Plot the columns in file: /pghbio/dbmi/batmanlab/Data/COPDGene/ClinicalData/phase 1 Final 10K/phase 1 Pheno/Final10000_Phase1_Rev_28oct16.txt against each other to see which columns holds important information.
- For FinalGold: Color by Gold score
- Important columns: FEV1, FEV1_FVC
- Randomly shuffle the data and plot a subset
- [x] Figure out the columns in the file which we can predict with image features. As of now we have
- Exacerbation
- Mortality
- Change in FEV1
- Change in FEV1/FVC
- [x] Compare image feature with
- Only Age, gender and pack of smoking
- Age + gender + pack of smoking + image features
- Age + gender + pack of smoking + FEV1
- image features
- [x] Come up with a plot/diagram/figure to show the above comparison in most informative manner
- [x] Finalize the models for predictions
- Binary/ unbalanced multi label classification
- Regression Model
- [ ] Finalize the statistics we are going to report for each model. Example: auc, R-Square, confusion matrix
The first two tasks are tracked by issue #52 We are comparing 6 kind of features:
We are hoping if our image features are good then 5 would perform better than 6. And 3 would perform better than 4.
For the classification problem, we are plotting
For the regression problems we are showing
For un-balanced data we are doing
Other points
We will track the performance of the pipeline on real image features using issue:
[x] Plot the columns in file: /pghbio/dbmi/batmanlab/Data/COPDGene/ClinicalData/phase 1 Final 10K/phase 1 Pheno/Final10000_Phase1_Rev_28oct16.txt against each other to see which columns holds important information.
For FinalGold: Color by Gold score
Important columns: FEV1, FEV1_FVC
Randomly shuffle the data and plot a subset
[x] Figure out the columns in the file which we can predict with image features. As of now we have
Exacerbation
Mortality
Change in FEV1
Change in FEV1/FVC
[x] Compare image feature with
Only Age, gender and pack of smoking
Age + gender + pack of smoking + image features
Age + gender + pack of smoking + FEV1
image features
[x] Come up with a plot/diagram/figure to show the above comparison in most informative manner
[x] Finalize the models for predictions
Binary/ unbalanced multi label classification
Regression Model
[x] Finalize the statistics we are going to report for each model. Example: auc, R-Square, confusion matrix