XieResearchGroup / DISAE

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization
Other
11 stars 4 forks source link

question about "gpcr_uniprot2triplets.json" #4

Open wawpaopao opened 2 years ago

wawpaopao commented 2 years ago

I wonder what the document "gpcr_uniprot2triplets.json' means because I want to add some protein sequence to fine-tune the model,but I don't know how to get the representation like 210 triplest in this file.

lxie21 commented 2 years ago

Hi,

Thanks for your interest. The procedure is described in the paper: https://pubs.acs.org/doi/abs/10.1021/acs.jcim.0c01285 We will add scripts for the triplet extractions soon.

In brief, the triplet extraction follows the following steps.

  1. Obtain a multiple sequence alignment (MSA) for the sequence of your interest.
  2. Identify the top 210 most conserved positions in your sequence from the MSA.
  3. Extract triplets for these positions.

Best, Lei

On Thu, Dec 16, 2021 at 9:57 AM wawpaopao @.***> wrote:

I wonder what the document "gpcr_uniprot2triplets.json' means because I want to add some protein sequence to fine-tune the model,but I don't know how to get the representation like 210 triplest in this file.

— Reply to this email directly, view it on GitHub https://github.com/XieResearchGroup/DISAE/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZSBCVCIYL54YZLMFDCBOLURH44FANCNFSM5KGQVHIA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

wawpaopao commented 2 years ago

Hi, Thank you very much. I am still a bit confused about this file.Take the first senqunece 'A0A016RYG7' in the 'gpcr_uniprot2triplets.json' as an example,there are 210 conserved positions. I don't understand how to get the distilled triplets of 'A0A016RYG7' like 'pnt ntp tpl pls' at the start because it seems no relation between the sequence of 'A0A016RYG7' in the Uniprot. I mean there is no 'pnt' in the sequence.

Best, Aowen