PDB-REDO / alphafill

AlphaFill is an algorithm based on sequence and structure similarity that “transplants” missing compounds to the AlphaFold models. By adding the molecular context to the protein structures, the models can be more easily appreciated in terms of function and structure integrity.
https://alphafill.eu
BSD 2-Clause "Simplified" License
89 stars 16 forks source link

Documentation about the --pae-file option #42

Closed phupe closed 5 months ago

phupe commented 5 months ago

Dear @mhekkel ,

I used the --pae-filefile option with my local installation of AlphaFill and got an error.

alphafill process --min-hsp-identity 0.9 --pae-file pae.json --pdb-dir pdb-redo/mmcif_files --pdb-fasta pdb-redo/fasta/pdb-redo.fasta --ligands pdb-redo/ligands/af-ligands.cif ranked_0.pdb res.cif

Error when processing 7VUX for nohd
 >> The supplied PAE data is inconsistent with the residues in the AlphaFold structure for asym ID A

My data

My input fasta file contains 2 proteins:

I ran AlphaFold-multimer.

The PAE json file was generated using the function get_pae_json from the source code https://github.com/google-deepmind/alphafold/blob/v2.3.2/alphafold/notebooks/notebook_utils.py#L146. It corresponds to the format explained in the AlphaFold FAQ (see https://alphafold.ebi.ac.uk/faq). In my opinion, the format is correct and compliant with what your source code expects.

My PAE matrix in the json file is therefore 259x259.

Debug

I have patched the AlphaFill source code to print additional information before the following block: https://github.com/PDB-REDO/alphafill/blob/v2.1.1/src/alphafill.cpp#L774-L775

The values of the following variables in the source code are:

id
1
seq.length()
135
af_res.size()
135
pae.dim_m()
259
v_pae.empty()
0

Therefore, the condition if (not v_pae.empty() and pae.dim_m() != af_res.size()) as 259 != 135 is true (https://github.com/PDB-REDO/alphafill/blob/v2.1.1/src/alphafill.cpp#L774) which raised the error.

Note that without using the --pae-file option, it just works fine.

Questions

My understanding is that the PAE matrix should contain the information only for the first chain, is it correct? Or is there anything I did wrong?

Do we expect a better prediction when we provide the PAE matrix, or can I just use AlphaFill without the --pae-file.

Maybe additional information should be provided in the documentation of the the --pae-file option.

Thanks.

mhekkel commented 5 months ago

AlphaFill was designed to work with just a single chain. It might do something with multiple chains, but apparently the PAE code is not taking into account multiple chains.

The pae file is the same format as downloaded from AlphaFold. The location used to fetch this data comes from 3d beacons.

Anyway, we're not doing anything special with this data. We were asked to provide the data and use it to calculate scores for placed ligands based on the PAE score of neighbouring atoms. But the results were not convincing and did not add more information. As a result, the pae scores are now filled in in the json file, but that's about it.

phupe commented 5 months ago

Thank you @mhekkel for your feedback.