baker-laboratory / RoseTTAFold-All-Atom

Other
571 stars 96 forks source link

Protein sequence truncated internally? #124

Open amorehead opened 3 weeks ago

amorehead commented 3 weeks ago

Hello.

When running RoseTTAFold-All-Atom using the following input YAML config file, I observed that the inference code returned a .pdb file that contains a predicted protein structure with the first 19 residues removed (as compared to the input FASTA sequence for this protein chain). Is there a reason why this particular sequence motif would be automatically removed by the code? For context, I noticed that the sequence alignment files (e.g., generated by HHblits) seem to show that the sequence was truncated before being passed downstream (e.g., to HHblits).

Config file input:

# RoseTTAFold-All-Atom inference configuration file for input 'XYZ'
defaults:
  - base
job_name: "XYZ"

protein_inputs:
  a:
    fasta_file: /tmp/tmpvwt59__7/XYZ_A.fasta

sm_inputs:
  b:
    input: 'CCOC(=O)N(C)Cc1c(C(=O)O)n(Cc2cccc3ccccc23)c2ccc(F)cc12'
    input_type: "smiles"

where /tmp/tmpvwt59__7/XYZ_A.fasta has the following file contents:

>A
MLLLPLPLLLFLLCSRAEAGEIIGGTESKPHSRPYMAYLEIVTSNGPSKFCGGFLIRRNFVLTAAHCAGRSITVTLGAHNITEEEDTWQKLEVIKQFRHPKYNTSTLHHDIMLLKLKEKASLTLAVGTLPFPSQKNFVPPGRMCRVAGWGRTGVLKPGSDTLQEVKLRLMDPQACSHFRDFDHNLQLCVGNPRKTKSAFKGDSGGPLLCAGVAQGIVSYGRSDAKPPAVFTRISHYRPWINQILQAN

t000_.msa.a3m file contents:

>A
GEIIGGTESKPHSRPYMAYLEIVTSNGPSKFCGGFLIRRNFVLTAAHCAGRSITVTLGAHNITEEEDTWQKLEVIKQFRHPKYNTSTLHHDIMLLKLKEKASLTLAVGTLPFPSQKNFVPPGRMCRVAGWGRTGVLKPGSDTLQEVKLRLMDPQACSHFRDFDHNLQLCVGNPRKTKSAFKGDSGGPLLCAGVAQGIVSYGRSDAKPPAVFTRISHYRPWINQILQAN
>UniRef100_A0A0G2K4T4 Mast cell protease 8 n=1 Tax=Rattus norvegicus TaxID=10116 RepID=A0A0G2K4T4_RAT
GEIIWGTESKPHSRPYMASITFYDSNSDLNHCGGFLVAKDIVMTAAQCNGSNIKVTLGAHNIKKQENT-QVISVVKAKPHENYHKHSQFNDIMLLKLERKAQLNGAVKTIALPRSQDSVKPGQVCTMAGWGTLANCTLSNTL-QEVNLEVQKGQKCQgMSEDYNDSIQLCVGNPNEMKATAGGDSGGPFVCDGVAQGIVSYRLCTGTLPRVFTRISSFIPWIQKTMKLL
...

Notably, in the output .pdb file (as attached), the first 19 residues (i.e., MLLLPLPLLLFLLCSRAEA) are not present in the resulting structure.

XYZ.pdb.txt

sky1ove commented 3 weeks ago

I guess it is to remove the signal peptide?