alizat / EnsemDT-EnsemKRR

This repo holds the code and data for my 2017 paper that was published in the METHODS journal (EL-SEVIER)
2 stars 1 forks source link

Looking for help regarding scores presented in Drug-target interaction prediction using ensemble learning and dimensionality reduction #3

Open mislam5285 opened 3 years ago

mislam5285 commented 3 years ago

Dear Sir, I have some queries regarding AUC score reported in your article titled “Drug-target interaction prediction using ensemble learning and dimensionality reduction”

Query 1: I have run your code provided in “https://github.com/alizat/EnsemDT-EnsemKRR”. The AUC score that is computed by “ensemkrr_S1_DimRed1 : predictionMethods is ensemkrr, CV setting is S1, and the dimensionality reduction method is SVD” is :

FOLD 1: 0.873 FOLD 2: 0.878 FOLD 3: 0.879 FOLD 4: 0.874 FOLD 5: 0.875

AUC: 0.876

In the published article on page no.05 and Table 5, the AUC score is reported as 0.942.

If you kindly specify for which hyperparameters setting in grid-search the AUC score 0.942 is obtained.

Sir, I need this for comparing the efficiency of your method with respect to other methods.

Query 2: In your article, only AUC score is given. I want to compute F1 score and G-mean score from your scoring matrix named “predScoresMatrix”. What I have done for computing F1 and Gmean score is given below. I want to know your valuable comments on whether the computed F1 score and Gmean score are correct or not. I would like to present this score in one article for comparison.

We have seen that every ensemble learning-based method is predicting a score between 0 and 1 instead of a 0/1 class label for each test sample. To compute F1 and Gmean scores from the predicted scores we have mapped it to a crisp class label using one threshold value which is optimally determined. Because the classification problem in this ensemble learning-based method has a severe class imbalance, the default threshold, say 0.5, may result in poor performance. The optimal threshold is determined by computing the ROC curve which provides three lists of values called false positive rate(fpr), true positive rate (tpr), and thresholds. These three lists have the same length i.e., for each threshold in the list called thresholds, there is one value in the list called fpr and tpr. We have computed the G-mean score for each threshold. Finally, we have selected the largest G-mean. For the F1 score, the optimal threshold is selected from the precision-recall curve. Corresponding to the optimal threshold value, the precision and recall value is taken for F1 score computation.

My Code in python:

ulr1='./ensemkrr_S1_DimRed1_Ytrue.txt' url2='./ensemkrr_S1_DimRed1_YPredProb.txt'

testy=pd.read_csv(ulr1,delimiter="\t",header=None) yhat=pd.read_csv(url2,delimiter="\t",header=None)

#########################Gmean Score########################################### ###############################################################################

calculate roc curves

fpr, tpr, thresholds = roc_curve(testy, yhat)

calculate the g-mean for each threshold

gmeans = sqrt(tpr * (1-fpr))

locate the index of the largest g-mean

ix = argmax(gmeans) print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix])) ############################################################################## #########################F1 Score############################################# from sklearn.metrics import precision_recall_curve

calculate roc curves

precision, recall, thresholds = precision_recall_curve(testy, yhat)

convert to f score

fscore = (2 precision recall) / (precision + recall)

locate the index of the largest f score

ix = argmax(fscore) print('Best Threshold=%f, F-Score=%.3f' % (thresholds[ix], fscore[ix]))

Output: Best Threshold=0.074662, G-Mean=0.875 Best Threshold=2.907300, F-Score=0.342

Your valuable comments are important for me because before presenting these scores I want your guidance to know whether I have computed the scores correctly or not.

alizat commented 3 years ago

Dear @mislam5285 , Thanks for reaching out to me.

I sending this message to let you know that I am currently looking at your queries so I could reply to them. Please note that I am quite busy these days, so I am unable to reply right away and will need at least a few days to investigate the matter stated in your first query and to check your code in the second query.

By the way, a quick run of "EnsemKRR with SVD and CV setting S1" gave me much better results than what you got (AUC for FOLD 1 was 0.937), so I am not exactly sure what went wrong on your side. I guess, try re-downloading the code from GitHub and try again (in start.m, modify line 27 to "predictionMethods = {'ensemkrr'};" AND line 33 to "dimRedTechniques = 1;", and press F5)

As for the second query, optimizing the threshold (for converting to 0/1 class labels) using the f-score is fine and commonplace. The G-mean way seems to make sense as well. Note that I haven't checked your code yet. By the way, have you considered an evaluation metric called the MCC (Matthew Correlation Coefficient)?

In the upcoming days, I will have another look at your queries and try to reply as soon as I can.

Yours Sincerely, Ali Ezzat

mislam5285 commented 3 years ago

Dear @mislam5285 , Thanks for reaching out to me.

I sending this message to let you know that I am currently looking at your queries so I could reply to them. Please note that I am quite busy these days, so I am unable to reply right away and will need at least a few days to investigate the matter stated in your first query and to check your code in the second query.

By the way, a quick run of "EnsemKRR with SVD and CV setting S1" gave me much better results than what you got (AUC for FOLD 1 was 0.937), so I am not exactly sure what went wrong on your side. I guess, try re-downloading the code from GitHub and try again (in start.m, modify line 27 to "predictionMethods = {'ensemkrr'};" AND line 33 to "dimRedTechniques = 1;", and press F5)

As for the second query, optimizing the threshold (for converting to 0/1 class labels) using the f-score is fine and commonplace. The G-mean way seems to make sense as well. Note that I haven't checked your code yet. By the way, have you considered an evaluation metric called the MCC (Matthew Correlation Coefficient)?

In the upcoming days, I will have another look at your queries and try to reply as soon as I can.

Yours Sincerely, Ali Ezzat

Thank you Sir for reply. I will follow your instruction for running "EnsemKRR with SVD and CV setting S1" .

alizat commented 3 years ago

Dear @mislam5285 , I downloaded the code in this repo and followed the same instructions that I gave you earlier (in start.m, modify line 27 to "predictionMethods = {'ensemkrr'};" AND line 33 to "dimRedTechniques = 1;", and press F5). This time, I followed the experiment through to the end, and following are my results:

FOLD 1: 0.937 FOLD 2: 0.943 FOLD 3: 0.941 FOLD 4: 0.943 FOLD 5: 0.939

If your results are not similar to the above and are still getting unsatisfactory results (specifically in the case of using SVD for dimensionality reduction), then maybe look into the documentation for the svds() function (which is used in reduceDimensionality.m). Please note that I ran the experiment on MATLAB 2015a. Is it possible that the svds() function has been modified recently, and it is this modification that caused a change in the results? In other words, maybe the svds() function of MATLAB 2015a is a bit different than that of the version of MATLAB that you are currently using, and that is what caused the difference. I am assuming that your prior results were obtained from MATLAB as well (or were these results from a re-implementation of EnsemKRR that you have coded in Python?).

As for the code for G-mean and F-score for evaluating your results... Deciding the best evaluation metric by which to gauge the prediction performance can be tough. However, what I would say as guidance is that predicting the "interactions" correctly is more important than predicting the "non-interactions" correctly. In practice, if a model ranks a few drug-target pairs highly, then that's what the upcoming wet-lab experiments will focus on verifying next (such wet-lab experiments are expensive, which is why the top predictions will be considered before others). For that reason, we always want the drug-target pairs ranked at the very top of your predictions to be super accurate, and it is usually a good idea to include metrics that confirm that this is indeed the case with your prediction model.

Let's have look at your two metrics now:

As for the code you shared above... On first glance, the code seems to be accurate. To be honest, I haven't coded in either Python or MATLAB for a long time now, I have forgotten things, so I may have missed something.

A few final notes here:

Hope this helps. Let me know if you need anything else.

Best Regards, Ali Ezzat

mislam5285 commented 3 years ago

Dear @mislam5285 , I downloaded the code in this repo and followed the same instructions that I gave you earlier (in start.m, modify line 27 to "predictionMethods = {'ensemkrr'};" AND line 33 to "dimRedTechniques = 1;", and press F5). This time, I followed the experiment through to the end, and following are my results:

FOLD 1: 0.937 FOLD 2: 0.943 FOLD 3: 0.941 FOLD 4: 0.943 FOLD 5: 0.939

If your results are not similar to the above and are still getting unsatisfactory results (specifically in the case of using SVD for dimensionality reduction), then maybe look into the documentation for the svds() function (which is used in reduceDimensionality.m). Please note that I ran the experiment on MATLAB 2015a. Is it possible that the svds() function has been modified recently, and it is this modification that caused a change in the results? In other words, maybe the svds() function of MATLAB 2015a is a bit different than that of the version of MATLAB that you are currently using, and that is what caused the difference. I am assuming that your prior results were obtained from MATLAB as well (or were these results from a re-implementation of EnsemKRR that you have coded in Python?).

As for the code for G-mean and F-score for evaluating your results... Deciding the best evaluation metric by which to gauge the prediction performance can be tough. However, what I would say as guidance is that predicting the "interactions" correctly is more important than predicting the "non-interactions" correctly. In practice, if a model ranks a few drug-target pairs highly, then that's what the upcoming wet-lab experiments will focus on verifying next (such wet-lab experiments are expensive, which is why the top predictions will be considered before others). For that reason, we always want the drug-target pairs ranked at the very top of your predictions to be super accurate, and it is usually a good idea to include metrics that confirm that this is indeed the case with your prediction model.

Let's have look at your two metrics now:

* A good F-score implies good precision [TP / (TP + FP)] and recall [TP / (TP + FN)] on the **positive minority class** (i.e. the "interactions"). A good precision means that a positive prediction from your model is usually true, and a good recall means that most actual interactions will be detected by your model. Consequently, a good F-score would mean that both these favorable characteristics are met in your model.

  * You may notice that the precision and recall both do not have a TN term. That is fine because what we really care about are the interactions and whether they are detected successfully and accurately without too many false positives (FPs).

* On the other hand, your G-mean score gives equal importance to the prediction of BOTH the positive and negative classes. Your G-mean score is basically trying to strike a good balance between sensitivity and specificity. From a quick glance, G-mean score does not seem to completely match the general objective required by the upcoming wet-lab experiments that would make use of your computational model. There might be some value in keeping it around (it might turn out that your model produces a better G-mean score that other competing models), BUT I personally would not use it to specify the threshold for binarizing the prediction results.

As for the code you shared above... On first glance, the code seems to be accurate. To be honest, I haven't coded in either Python or MATLAB for a long time now, I have forgotten things, so I may have missed something.

A few final notes here:

* In a couple of my earlier [publications](https://scholar.google.com.eg/citations?pli=1&user=5zVp10IAAAAJ), I used to use the AUPR (area under the precision-recall curve) that theoretically matches the objective required, but I found that it punishes highly-ranked incorrect predictions too severely.

* [This paper](https://jcheminf.biomedcentral.com/track/pdf/10.1186/s13321-016-0128-4.pdf) has a different way of assessing the prediction results of its proposed model. Do have a look. If the evaluation method used in this paper seems a bit too "non-standard" for your taste, leave it for now, but keep it in the back of your mind and revisit later as it is interesting and worth checking out.

* By the way, this is a link to my thesis that has a lot of commentary that you may find useful.

  * https://www.researchgate.net/publication/329075265_Challenges_and_Solutions_in_Drug-Target_Interaction_Prediction

Hope this helps. Let me know if you need anything else.

Best Regards, Ali Ezzat Dear Sir, Thank you very much for giving your valuable time and suggestions for my queries. This time, following your instruction, I have obtained the correct results that you have reported in your article. I totally agree with you about deciding performance metrics. Once again I confirm that your reported results are now correctly reproduced by the code in github.

The thresholding-based binarization of the prediction scores then computing F1 is not giving good scores. I think thresholding-based binarization is not appropriate here. In my work, I will report only your AUC score.

Thanks for sharing and recommending your thesis and publications.

JAZAKALLAH KHAIR Best Regards, Sk Mazharul Islam