Model is adding an amino acid to the original sequence

tony-res commented 9 months ago

I'm using a nanobody PDB as the input to ProteinMPNN. I simply changed examples/submit_example_1.sh to the directory for the attached pdb file.

When I run ProteinMPNN, the FASTA file has an additional proline inserted into the sequence. Is there something different about my PDB file that may be causing this behavior?

The PDB file has this sequence:

NANOBODY_TESTING.H EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYY ADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDPTAPNDYWGQGTLVTVSS

ProteinMPNN gives this for the FASTA:

NANOBODY_TESTING, score=1.6920, global_score=1.6920, fixed_chains=[], designed_chains=['H'], model_name=v_48_020, git_hash=8907e6671bfbfc92303b5f79c4b5e6ce47cdef57, seed=37 EVQLVESGPGLVQPGKSLRLSCVASGFTFSGYGMHWVRQAPGKGLEWIALIIYDESNKYY ADSVKGRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDTPAPNDYWGQGTLVTVSS

Note the highlighted proline is not in the original PDB file.

I'm probably doing something wrong, but I'm having trouble seeing it. Any help would be greatly appreciated.

Thanks! -Tony

NANOBODY_TESTING.txt

tony-res commented 9 months ago

I've tracked it down a little more. I think it is coming from the script that creates the parsed_pdbs.jsonl file because that file looks like this:

{"seq_chain_H": "EVQLVESGP-GLVQPGKSLRLSCVASGFTF----SGYGMHWVRQAPGKGLEWIALIIYD--ESNKYYADSVK-GRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDTPAPNDYWGQGTLVTVSS", "coords_chain_H": {"N_chain_H": [[-11.394, -4.934, -17.026], [-9.415, -5.505, -14.037], [-7.515, -2.819, -12.586], [-4.768, -1.263, -10.961], [-2.865, 1.551, -10.077], [-0.044, 2.882, -8.846], [1.493, 5.892, -7.505], [3.436, 8.866, -6.619], [6.281, 8.208, -5.929], [NaN, NaN, NaN], [9.799, 8.533, -4.849], [13.014, 10.127, -3.929], [13.389, 10.802, -0.512], [14.366, 12.408, 2.32], [15.559, 11.094, 5.561], [13.32, 11.104, 8.369], [11.537, 12.932, 7.067], [8.281, 13.081, 5.352], [6.011, 11.214, 3.226], [3.476, 10.527, 0.716], [1.955, 8.134, -0.727], [-0.102, 7.201, -3.541], [-2.64, 5.229, -4.93], [-4.499, 4.565, -7.639], [-7.528, 2.983, -8.404], [-9.327, 0.896, -10.749], [-11.897, -0.72, -12.862], [-13.244, -2.535, -11.189], [-15.546, -2.74, -8.358], [-14.286, -0.709, -6.165], [NaN, NaN, NaN], [NaN, NaN, NaN], [NaN, NaN, NaN], [NaN, NaN, NaN],

Note that the P is there. So the code seems to be buggy in this script parse_multiple_chains.py

tony-res commented 9 months ago

I printed out an intermediate variable seq in parse_multiple_chains.py. It gives me this: Note that at position 111, there is a dictionary with two values rather than one. That's where the bug is coming from. I'll see if I can find a patch and submit it.

tony-res commented 9 months ago

It's this. The PDB file is IMGT numbered. The parse code is assuming that it is just an integer. So the 112A and the 112 positions are getting lumped together.

tony-res commented 9 months ago

            if resn[-1].isalpha(): 
                print(resn)
                resa,resn = resn[-1],int(resn[:-1])-1
                print(resn)
            else: 
                resa,resn = "",int(resn)-1

Note that if the position has an alphabetic character (e.g. "112A"), then it removes the character and subtracts 1 from the integer. So "112A" is listed as position 111.

tony-res commented 9 months ago

I created a patch and did a PR.

dauparas / ProteinMPNN

Model is adding an amino acid to the original sequence #92