Kuhlman-Lab / ThermoMPNN

GNN trained to predict changes in thermodynamic stability for protein point mutants
MIT License
112 stars 17 forks source link

Fireprot Results #16

Closed lordim closed 6 months ago

lordim commented 7 months ago

Hi! Really enjoyed the paper and am looking forward to using this model! I had a question about the fireprot results however. The points in figure 2a for the results on FireProt (HF) after training on Megascale show the correlations are about 0.55, while table 1 suggests the correlations on Fireprot (HF) for ThermoMPNN are at around 0.65. What was done differently in table 1 compared to figure 2a to get this performance boost?

hdieckhaus commented 7 months ago

Hi @lordim,

Thanks! Happy to clear up any ambiguity. The difference in model performance comes from the fact that fig 2a and table 1 are reporting performance on two different splits (subsets) of the Fireprot dataset.

You are correct that table 1 reports Fireprot (HF) performance, but fig 2a is actually showing a different split, what we call Fireprot (test). This is a much smaller subset of data (about 350 data points, compared to about 2500 points in the HF-split), and it happens to be a bit harder for all of our models to predict. You can see how we made these splits in the Methods and in fig 1b, and if you want to extract them yourself, you can see the dataset_splits/fireprot_splits.pkl file in the repo for the full PDB list.

The reason we do this is so that we are trying to compare the Megascale-trained model to a Fireprot-trained model, and we want to use most of the Fireprot dataset to do this, which we can't do if we are testing on the Fireprot (HF) set. So we take 10% of the Fireprot dataset (excluding any homologues to Megascale) and use the rest for training a model for comparison.

Now, if you want to know why this dataset split is harder than the other one, I have some theories, but no definitive answer. One reason is that we excluded proteins with many mutations (>250) from the (test) split, since this would really skew the results badly, but these proteins tend to have better-behaved (or better curated) measurements, since they're well-studied. Maybe the test set has more large proteins, which tend to confound most stability models. It is hard to say for certain!

Let me know if this clears things up!