dauparas / ProteinMPNN

Code for the ProteinMPNN paper
MIT License
910 stars 278 forks source link

--bias_by_res_jsonl example? #73

Closed tony-res closed 9 months ago

tony-res commented 9 months ago

I'm able to mutate just a few positions in the sequence. What I'd like to do now is to provide amino acid biases to these positions.

Does anyone have an example of how to do this?

My current attempt is to pass --bias_by_res_jsonl into proteinmpnn_run.py. For the JSON I have a dictionary with a 2D array (sequence length by 21 amino acid ids).

{"PROTEIN123": {"A": [[0.0001, 0.5,, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.49, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001, 0.0001], ... ]]}}

where each row is a position in the sequence and each column is an amino acid. For example, the 0.5 above would weight the second amino acid ("C") as having the highest probability in that position.

Is this correct? When I run it ProteinMPNN doesn't seem to respect the biases.

tony-res commented 9 months ago

It looks like the parsed_pdbs.jsonl file includes a - (which I assume is a gap). So I needed to put a X in the original chain sequence at that gap position.

linamp commented 9 months ago

From their examples it would look like something like this: {A: -1.1, F: 0.7}. I tried using one and my json file looks like this, and it ran ok: {"A": -1.50487820683775, "C": -1.50336065414758, "D": 0.451363945615286, "E": 0.407588387244967, "F": 1.37065067139209, "G": -1.50487820683775, "H": 0.801568412577877, "I": 0.538915062355937, "K": -0.292820546680218, "L": 0.670241737466907, "M": -0.0301671964582751, "N": -0.227157209124733, "P": -1.50487820683775, "Q": -0.0301671964582751, "R": 0.889119529318528, "S": -0.774351688753783, "T": -0.686800572013134, "V": 0.0136083619120431, "W": 1.63330402161404, "Y": 1.28309955465145}

In one of the examples, they have the usage of one of their scripts to automatically create the json file.

I do not know if it is possible to have different biases for each position. It would be tedious but maybe you can write a script to design one position at a time passing the respective bias. I would also like to know if there is a better approach here!

tony-res commented 9 months ago

Thanks!

Yes. I've been successful doing the per residue bias. Essentially it is a JSON where the first level is the name, the second is the chain, and the third is a N x M matrix of bias values where N is the number of residues and the M is 21 (the 20 amino acids plus a gap character). This works well so long as you account for any gaps that are in the PDB file's sequence.