Open tony-res opened 9 months ago
I've tracked it down a little more. I think it is coming from the script that creates the parsed_pdbs.jsonl
file because that file looks like this:
{"seq_chain_H": "EVQLVESGP-GLVQPGKSLRLSCVASGFTF----SGYGMHWVRQAPGKGLEWIALIIYD--ESNKYYADSVK-GRFTISRDNSKNTLYLQMSSLRAEDTAVFYCAKVKFYDTPAPNDYWGQGTLVTVSS", "coords_chain_H": {"N_chain_H": [[-11.394, -4.934, -17.026], [-9.415, -5.505, -14.037], [-7.515, -2.819, -12.586], [-4.768, -1.263, -10.961], [-2.865, 1.551, -10.077], [-0.044, 2.882, -8.846], [1.493, 5.892, -7.505], [3.436, 8.866, -6.619], [6.281, 8.208, -5.929], [NaN, NaN, NaN], [9.799, 8.533, -4.849], [13.014, 10.127, -3.929], [13.389, 10.802, -0.512], [14.366, 12.408, 2.32], [15.559, 11.094, 5.561], [13.32, 11.104, 8.369], [11.537, 12.932, 7.067], [8.281, 13.081, 5.352], [6.011, 11.214, 3.226], [3.476, 10.527, 0.716], [1.955, 8.134, -0.727], [-0.102, 7.201, -3.541], [-2.64, 5.229, -4.93], [-4.499, 4.565, -7.639], [-7.528, 2.983, -8.404], [-9.327, 0.896, -10.749], [-11.897, -0.72, -12.862], [-13.244, -2.535, -11.189], [-15.546, -2.74, -8.358], [-14.286, -0.709, -6.165], [NaN, NaN, NaN], [NaN, NaN, NaN], [NaN, NaN, NaN], [NaN, NaN, NaN],
Note that the P is there. So the code seems to be buggy in this script parse_multiple_chains.py
I printed out an intermediate variable seq
in parse_multiple_chains.py
. It gives me this:
Note that at position 111, there is a dictionary with two values rather than one. That's where the bug is coming from. I'll see if I can find a patch and submit it.
It's this. The PDB file is IMGT numbered. The parse code is assuming that it is just an integer. So the 112A and the 112 positions are getting lumped together.
if resn[-1].isalpha():
print(resn)
resa,resn = resn[-1],int(resn[:-1])-1
print(resn)
else:
resa,resn = "",int(resn)-1
Note that if the position has an alphabetic character (e.g. "112A"), then it removes the character and subtracts 1 from the integer. So "112A" is listed as position 111.
I created a patch and did a PR.
I'm using a nanobody PDB as the input to ProteinMPNN. I simply changed
examples/submit_example_1.sh
to the directory for the attachedpdb
file.When I run ProteinMPNN, the FASTA file has an additional proline inserted into the sequence. Is there something different about my PDB file that may be causing this behavior?
The PDB file has this sequence:
ProteinMPNN gives this for the FASTA:
Note the highlighted proline is not in the original PDB file.
I'm probably doing something wrong, but I'm having trouble seeing it. Any help would be greatly appreciated.
Thanks! -Tony
NANOBODY_TESTING.txt