Fireprot Results - Githubissues

Hi @lordim,

Thanks! Happy to clear up any ambiguity. The difference in model performance comes from the fact that fig 2a and table 1 are reporting performance on two different splits (subsets) of the Fireprot dataset.

You are correct that table 1 reports Fireprot (HF) performance, but fig 2a is actually showing a different split, what we call Fireprot (test). This is a much smaller subset of data (about 350 data points, compared to about 2500 points in the HF-split), and it happens to be a bit harder for all of our models to predict. You can see how we made these splits in the Methods and in fig 1b, and if you want to extract them yourself, you can see the dataset_splits/fireprot_splits.pkl file in the repo for the full PDB list.

The reason we do this is so that we are trying to compare the Megascale-trained model to a Fireprot-trained model, and we want to use most of the Fireprot dataset to do this, which we can't do if we are testing on the Fireprot (HF) set. So we take 10% of the Fireprot dataset (excluding any homologues to Megascale) and use the rest for training a model for comparison.

Now, if you want to know why this dataset split is harder than the other one, I have some theories, but no definitive answer. One reason is that we excluded proteins with many mutations (>250) from the (test) split, since this would really skew the results badly, but these proteins tend to have better-behaved (or better curated) measurements, since they're well-studied. Maybe the test set has more large proteins, which tend to confound most stability models. It is hard to say for certain!

Let me know if this clears things up!

Kuhlman-Lab / ThermoMPNN

Fireprot Results #16