OATML-Markslab / ProteinGym

Official repository for the ProteinGym benchmarks
MIT License
210 stars 20 forks source link

Matching clinical, MSA and DMS protein and variant IDs #25

Closed BarKetPlace closed 5 months ago

BarKetPlace commented 6 months ago

Hi,

I am trying to identify proteins across the DMS, clinical and MSA datasets (Only substitutions for now). The purpose is to train models on DMS assays or MSA data and to evaluate them with clinical scores.

  1. When looking at the reference files "clinical_substitutions.csv", the proteins are IDed with something like e.g. NP_689699.3, while in "DMS_substitutions.csv" the proteins are IDed with their "uniprot_id", e.g. A0A140D2T1_ZIKV, how do I translate one name into the other ?

Similarly: I'd like to identify variants across MSA, DMS and clinical datasets.

  1. Is there currently a way to ensure that variants in the clinical dataset are not present in the DMS and MSA datasets ? This is to ensure that there is no circularity between the training and testing sets, and to have an idea of the quality of the scores reported in the benchmark.

I hope I haven't missed anything obvious.

Thanks for the great work

Antoine

loodvn commented 5 months ago

Hi Antoine!

Thanks for your query - I think matching between DMS and clinical data is a common use-case, and thanks for asking.

To summarize the full answer: The best way to match the IDs is to match on the actual sequences as opposed to trying to match the IDs together. Explanation below.

On Question 1:

The clinical substitutions benchmark is separated into proteins according to the RefSeq ID (NP_<>), while the DMS substitutions file contains UniProt IDs (and the DMS filenames start with a UniProt ID). a) We did this because the RefSeq IDs were easier to use with the human clinical data. In the raw substitutions benchmark there should be more columns, including UniProt IDs and (unmutated) wild-type sequences, but these mappings may also be incomplete. b) Importantly, the assayed wild-type sequence (denoted by target_seq in the reference file) sometimes has some mutations relative to the UniProt ID, and this makes it difficult to match clinical variants with DMS variants when just using the IDs. c) The most reliable way to map DMS and clinical variants are to match the sequences together (i.e. the target_seq in the DMS file and the protein_sequence in the raw clinical substitutions file).

Question 2:

In our results, we didn't include a supervised train-test setup for the clinical benchmark - most of the clinical predictors (supervised and unsupervised, each based on different train-test splits or amounts of information) that we included were downloaded from http://database.liulab.science/dbNSFP.

I would recommend holding out any genes according to a sequence similarity threshold: i.e. where a DMS sequence (target_seq) has e.g. a X% sequence similarity to a gene. Note that this isn’t symmetric, so you’d like to compute the max similarity in either direction (consider a DMS that tests a short section of a human protein: that section has 100% identity to the human protein, but might only span e.g. 20% of the protein’s length. You should exclude clinical variants from this whole protein not just the matching portion, to avoid Type 2 circularity described in [1]). If you'd like to go further, you could also exclude based on structural similarity instead of just sequence.

Question 3: Matching MSAs and DMS/clinical:

In my opinion though it is not necessary to hold out any MSAs since they are also used in the clinical set and don’t contain labelled assay data - some models (like EVE) are trained on MSAs to predict on clinical variants. I think you can use either the DMS MSAs or the clinical MSAs wherever there are overlaps.

Hope this helps! Please let me know if you have any follow-up questions.

Reference [1]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235876/

loodvn commented 5 months ago

I've been playing around a bit with the clinical data yesterday, and I noticed that you can get pretty far with the default UniProt - RefSeq mappings. I used the UniProt mapping tool (https://www.uniprot.org/id-mapping/) to get the following file (attached).

"From" column: The UniProt ID in the DMS_substitutions reference file "RefSeq" column: Matching RefSeq IDs (you can use the clinical_substitutions.csv file to compare the sequences and match the DMS sequence up with the RefSeq sequence - you might to offset the mutant positions to match up the DMS mutants and clinical variants).

I did the following, which might help:

refseq_regex = r'NP_\d+\.\d+'
df_ref_subsequence["RefSeq_list"] = df_ref_subsequence["RefSeq"].str.findall(refseq_regex)
# Then use df.explode() to expand these lists to one line per RefSeq ID, then just remember later to handle duplicate matches if there are any

Sorry I can't help much more but I hope this gets you a bit further!

BarKetPlace commented 5 months ago

thanks for the answer ! I clarifies my problem quite a lot.

I can't see the file you mentioned in your second response though ?

loodvn commented 5 months ago

Oops! Here you go (from https://www.uniprot.org/id-mapping/, selecting a bunch of extra columns to include in the output when I downloaded it) proteingym_uniprot_mappings.csv

pascalnotin commented 5 months ago

@BarKetPlace - ok to close this issue?

BarKetPlace commented 5 months ago

yes, thanks