How to get the FDR score when predict some new proteins without labels?

Violet969 commented 11 months ago

Hi, I have some questions when I read your paper. Your paper had produced an ensemble model method to improve model performance by compute the FDR score. I want to know how get the FDR score when I just predict some new proteins and I don't know the labels?

Wenhao-Jin commented 11 months ago

Hi @Violet969 , the FDR scores were calculated based on pre-calculated reference score tables (for each individual classifier, i.e. SONAR3.0, seqSVM, seqCNN and ProteinBERT-RBP) consisting of the proteins with corresponding classification scores and their labels. Using the reference score tables, we can calculate the FDR of a given protein based on its classification scores. The code is here (from line 505 to line 550).

If you are going to use our trained HydRA model to do the prediction, you don't need to worry about the FDR calculation, the default setting of HydRa2_0_predict.py will do that for you. If you are going to train your own HydRA model with a very different training set to this paper, you need to input the reference score tables generated by your own HydRA model. The following is an example of how we get the reference score tables in this paper.

The reference score tables were collected from the model selection and evaluation steps. Take seqSVM as an example, we did K-fold cross validation (CV) within the training set, and collected the seqSVM scores and class labels of the holdout validation set from each CV fold. In this way, the reference scores' distribution should be very similar to the prediction scores of new proteins (unseen by the model).

Hope this helps. Feel free to let me know if you have any questions!

Wenhao-Jin commented 1 month ago

Hi @Violet969, I close the issue for now, but feel free to reopen the issue if you still have questions~

Wenhao-Jin / HydRA

How to get the FDR score when predict some new proteins without labels? #3