Kuhlman-Lab / ThermoMPNN

GNN trained to predict changes in thermodynamic stability for protein point mutants
MIT License
117 stars 18 forks source link

Correlation Coefficients #10

Closed twidatalla closed 8 months ago

twidatalla commented 9 months ago

Hi all,

I'm interested in comparing whether vanilla ESM-IF likelihoods is better or worse than vanilla Protein-MPNN likelihoods for ddG prediction on Fireprot-HF, and I'm noticing with ESM-IF if I take spearman(all likelihoods, all stabilities) I get basically 0, but If, for each protein I evaluate spearman(protein likelihoods, protein stabilities) I get decent correlations, and then the average of these individual protein correlations will be about 0.5 (Similar to you findings for Protein-MPNN).

So i just wanted to clarify, (1) did you do the former or latter method to get your correlations and (2) with Protein-MPNN were you using zero-shot likelihoods or doing some sort of supervised fine-tuning?

hdieckhaus commented 9 months ago

Good questions!

(1) we did the former method - we took the log-probabilities from ProteinMPNN and multiplied them by -1, then simply calculated all statistics (correlation, RMSE, etc) across the entire test set in all cases. So the Fireprot-HF spearman correlation is calculated across all ~2500 mutations for all ~100 proteins. The code for this analysis is included in the repository at ThermoMPNN/analysis/thermompnn_benchmarking.py (see the example marked 'ProteinMPNN'). I am not surprised that you would get better correlations when just considering within a specific protein, and it is perhaps an interesting finding. But to be clear, this is not what is typically reported in the literature when you see correlation values.

(2) The results included in the preprint labeled "ProteinMPNN" are zero-shot ProteinMPNN likelihoods, as shown in the analysis script I mentioned. We tried fine-tuning, which you can find in the Table 1 ablation study, and this works about as well as the feature-extraction approach we ended up settling on, but is more prone to overfitting (see Figure S1, I think). In the final manuscript (hopefully posted soon), we converted the ProteinMPNN likelihoods to kcal/mol values using a linear regression fitted on the training set so that we could compare RMSE values.

Hope this helps!