OATML-Markslab / ProteinGym

Official repository for the ProteinGym benchmarks
MIT License
211 stars 20 forks source link

Target sequence clarification #27

Closed gavinmdouglas closed 4 months ago

gavinmdouglas commented 5 months ago

Hi there,

Thanks so much for building this resource!

Regarding the target_seq column in DMS_substitutions.csv, could you clarify whether this is the actual full protein sequence experimentally tested, or whether (at least in some cases) it can correspond to a specific subset of a protein that was experimented on (but the full protein was used for assays).

For instance, the sequence for TCRG1_MOUSE is GATAVSEWTEYKTADGKTYYYNNRTLESTWEKPQELK, but this sequence is a substring of the isoforms in UniProt: https://www.uniprot.org/uniprotkb/Q8CGF7/entry#sequences. However, I see that there is a PDB file for this shorter sequence specifically, which implies that this is the full tested protein (edit: in the ProteinGym database).

Many of the target sequences also do not start with methionine, which perhaps is due to post-translational modifications, but because of this I was less sure whether they correspond to independent proteins or not.

Could you clarify this point?

Thanks!

Gavin

loodvn commented 4 months ago

Hi @gavinmdouglas

Thanks for the question, it’s definitely something we’ve encountered when mapping between IDs. We’ve aimed to make the “target_seq” columns match the exact sequence used in the experiment, or in some cases remove extra information provided they doesn’t have an effect on the mutational effects (for example, protein tags).

In the case of TCRG1_MOUSE from the mega-scale stability dataset, this subsequence is correct and the same one they report under aa_seq in their dataset (which originally had padding sequences on each side, but they also removed these).

We know that the UniProt sequence ID often doesn’t match the target_seq exactly, especially for some of the domains tested in the mega-scale stability study - we just tried to choose the ID that matched most closely, preferably reporting the Swissprot reviewed UniProt ID.

It could be that models might provide better fitness estimates when using the full context, but in our experience the performance was equivalent in both cases (subsequence vs full sequence).

gavinmdouglas commented 4 months ago

Hi @loodvn,

Ok great -- thanks for clarifying!

Cheers,

Gavin