Closed tony-res closed 1 year ago
It looks like the parsed_pdbs.jsonl
file includes a -
(which I assume is a gap). So I needed to put a X
in the original chain sequence at that gap position.
From their examples it would look like something like this: {A: -1.1, F: 0.7}. I tried using one and my json file looks like this, and it ran ok: {"A": -1.50487820683775, "C": -1.50336065414758, "D": 0.451363945615286, "E": 0.407588387244967, "F": 1.37065067139209, "G": -1.50487820683775, "H": 0.801568412577877, "I": 0.538915062355937, "K": -0.292820546680218, "L": 0.670241737466907, "M": -0.0301671964582751, "N": -0.227157209124733, "P": -1.50487820683775, "Q": -0.0301671964582751, "R": 0.889119529318528, "S": -0.774351688753783, "T": -0.686800572013134, "V": 0.0136083619120431, "W": 1.63330402161404, "Y": 1.28309955465145}
In one of the examples, they have the usage of one of their scripts to automatically create the json file.
I do not know if it is possible to have different biases for each position. It would be tedious but maybe you can write a script to design one position at a time passing the respective bias. I would also like to know if there is a better approach here!
Thanks!
Yes. I've been successful doing the per residue bias. Essentially it is a JSON where the first level is the name, the second is the chain, and the third is a N x M matrix of bias values where N is the number of residues and the M is 21 (the 20 amino acids plus a gap character). This works well so long as you account for any gaps that are in the PDB file's sequence.
I'm able to mutate just a few positions in the sequence. What I'd like to do now is to provide amino acid biases to these positions.
Does anyone have an example of how to do this?
My current attempt is to pass
--bias_by_res_jsonl
intoproteinmpnn_run.py
. For the JSON I have a dictionary with a 2D array (sequence length by 21 amino acid ids).where each row is a position in the sequence and each column is an amino acid. For example, the 0.5 above would weight the second amino acid ("C") as having the highest probability in that position.
Is this correct? When I run it ProteinMPNN doesn't seem to respect the biases.