dimiboeckaerts / BacteriophageHostPrediction

Code and database related to our manuscript "Predicting bacteriophage hosts based on receptor-binding proteins"
MIT License
9 stars 4 forks source link

Possible bug in Protein Feature Composition (CTDC) feature extraction #2

Closed aaossa closed 3 years ago

aaossa commented 3 years ago

Hi,

I was using the RBP_functions.py module as a reference to extract features of certain proteins, but I noticed a possible bug. In the CTDC function, the loop overrides the encoding variable in each iteration, which is returned at the end. This is the code:

https://github.com/dimiboeckaerts/BacteriophageHostPrediction/blob/333cf9c9d01453248687134ed417c9760a52b4c7/RBP_functions.py#L245-L251

This causes to only consider the last of the properties (list defined in this line), which is not the desired behavior, I think. You can confirm this by replacing the whole list of properties with a list only containing the last value: ["solventaccess"]. I guess that the desired behavior is to calculate the encoding for each property and accumulate the whole sequence, resulting in a list of 39 values instead of only 3.

This possible error is also present in the supplementary material of the corresponding paper, as Table S3 (b) shows only 3 features for Protein composition, when it might be 39. In the reference paper (iFeature) the value is correctly described as 39 numerical features for the same purpose. The code of the reference paper does not have this possible bug and works as described:

https://github.com/Superzchen/iFeature/blob/942fb2760eae540a59214c540bd3329e8dca76e2/codes/CTDC.py#L75-L79

for p in property:
    c1 = Count(group1[p], sequence) / len(sequence)
    c2 = Count(group2[p], sequence) / len(sequence)
    c3 = 1 - c1 - c2
    code = code + [c1, c2, c3]

Is this actually a bug? If that's the case, does it affect the results displayed in the notebook significantly? I guess that more features might help to improve the models (depending on the nature of the missing features). Thanks.

Best, Antonio.

dimiboeckaerts commented 3 years ago

Hi Antonio

Thank you for bringing this to my attention! You are indeed correct that the CTDC function in RBP_functions.py is only considering the last property (solvent accessibility), instead of all the properties. In that sense, our paper is consistent with the code provided, as it also describes a total of 47 features (39 CTDT, 3 CTDC and 5 Z-scale features) used from the iFeature paper by Chen et al.. So I wouldn't call it a bug, but for the sake of clarity, I will adjust the function to better reflect how we exactly used it in our analyses for the paper!

However, I might also guess that adding the other properties could result in a more performant model! On the other hand, some properties are already represented by some of the other features that were computed (aromaticity, frac_hydrophobic, fractions of helix, turn and sheet, etc...). I'd say the CTDC function also considers these properties in a slightly more comprehensible/complex way, which could lead to new information for the model, but the information gain might also be insignificant compared to the 'evolutionary signal' present in the feature representation. I will repeat the analysis with a 'complete' implementation of the CTDC function and keep you updated!

Thanks again and kind regards

Dimi

dimiboeckaerts commented 3 years ago

Here are the resulting PR curves when a Random Forest model was trained as described in the paper, but now including all 39 CTDC features instead of only 3 (all other settings and parameters remained the same). Interestingly, at some thresholds the model slightly outperforms the original one, while at other thresholds the AUC is lower than before. While we would have predicted a slight improvement overall, the AUC scores point towards a more nuanced influence of extra features across different thresholds for RBP similarity.

If you have any remaining questions, comments or ideas, don't hesitate to ask!

weighted_PR_curves_all_feats