PDB-REDO / alphafill

AlphaFill is an algorithm based on sequence and structure similarity that “transplants” missing compounds to the AlphaFold models. By adding the molecular context to the protein structures, the models can be more easily appreciated in terms of function and structure integrity.
https://alphafill.eu
BSD 2-Clause "Simplified" License
89 stars 16 forks source link

Alphafill can't find pdb files albeit finding them when generating the pdb-id-list #37

Closed sadiogo closed 9 months ago

sadiogo commented 11 months ago

Hello!

I`ve been having some trouble running alphafill. I've managed to create the conf file and correctly provide paths for the ligands, pdb-dir and pdb-fasta files. It looks like this:

ligands=/home/sadiogo/alphafill/af-ligands.cif
pdb-fasta=/home/sadiogo/alphafill/pdbredo_seqdb.txt
pdb-dir=./pdb

While the fasta file is the original seq file from https://pdb-redo.eu/others/pdbredo_seqdb.txt, my PDB database contains only 10 cif files downloaded manually from PDB-REDO. This database was organized as previously mentioned here, wherein cif files were placed in 4-letter folders (PDB-ID) and these folders were all placed in a 2-character folder named 00:

pdb/00/1s4p/1s4p_final.cif
pdb/00/1s4n/1s4n_final.cif
pdb/00/7xjv/7xjv_final.cif
pdb/00/5mm0/5mm0_final.cif
...

When I run alphafill process input.cif output.cif I get the following message:

PDB file for 1s4n not found
PDB file for 1s4n not found
PDB file for 1s4o not found
PDB file for 1s4o not found
PDB file for 1s4p not found
PDB file for 1s4p not found
PDB file for 7boo not found
PDB file for 7boo not found
PDB file for 7boo not found
PDB file for 7boo not found
PDB file for 7boo not found
PDB file for 7boo not found
PDB file for 7bop not found
PDB file for 7bop not found
PDB file for 7bop not found
PDB file for 7bop not found
PDB file for 7bop not found
PDB file for 7bop not found
PDB file for 5a07 not found
PDB file for 5a07 not found
PDB file for 5a08 not found
PDB file for 5a08 not found
  1. I assume that alphafill has extracted the sequence from the input cif file and ran blast against the pdbredo-seqdb.txt fasta file, thereby returning 7 hits with more than 25% identity, but was unable to find the structure file for these 7 hits. Is this correct?

Then I decided to create a pdb-id-list file by running alphafill prepare-pdb-list > pdb-id-list.txt. This correctly generated a pdb-id-list file containing the 10 proteins in my pdb folder:

1S4O, 1S4P, 2BO7, 2BO8, 2C59, 5GGI, 5MM0, 6YV9, 7E9K, 7XJV

Then, when I run alphafill process input.cif output.cif --pdb-id-list=pdb-id-list.txt, I don't get the previous error messages, but the output.cif file contains none of the ligands and the json file shows no hits:

{"alphafill_version":"2.0.0","date":"2023-09-28","file":"inputs\/caMNT1.pdb.cif","hits":[],"id":"CAMNT1"}

I have three further questions:

  1. When reading alphafill -h, it states that pdb-id-list will contain the proteins that have transplantable ligands. However, this is in contradiction with what is written in the README.md file, which states that the list will contain proteins "that can be skipped since they do not contain interesting ligands". Which one is true?

  2. What would be the successfull outputs from alphafill? For example, the json file would contain the hits with unique transplantable ligands and the cif file would depict my protein with all the ligands bound?

  3. Any idea why running alphafill process without the pdb-id-list results in the "PDB file not found" traceback? For example, the structure of 1s4p is available in the pdb folder but alphafill stated it could not find it.

I have also tried changing the min-hsp-identity setting to either 0.1 or 0.9 to see if different proteins would pop up in the "PDB file not found" messages (assuming those were the positive hits from blast), but always got the same results.

mhekkel commented 11 months ago

First of all, use full paths in your config file. That may help. And then use the --verbose flag for the process step, it will tell you why PDB entries are skipped.

The pdb-id-list contains the ID's of PDB files to process, not the ones to skip.

Output from alphafill should be the same as the files you can download from the website for alphafill. So, a cif with ligands and a json with details about the bound ligands.

So, try the --verbose flag and if that doesn't help, post the output here so I can have a look.

drlemmus commented 11 months ago

Also the typical file organisation requires the two-letter directory names to be the middle two characters of the PDBid.

sadiogo commented 11 months ago

Okay, I figured out some things. I will explain everything I did because it might be important going forward.

  1. I used full paths in the conf file but obtained the same results. Furthermore, using verbose resulted in:
    
    Blasting:
    MASTRSNARLIRFGIFALVLIGCGYILTRGSSFQPPNYQQTQSPAAHEKQTGNVAAGGGAGSGSAGAQVPLGKNRGPIPKAIMGAGEGGSDAPVPQQDIPDSYTLNDKIKATFVTLARNSDLYSLAESIRHVEDRFNKKFHYDWVFLNDEEFNDEFKETVGSLVSGNTKFGLIPKEHWSYPPWIDQEKAALVREQMREKKIIYGHSESYRHMCRFESGFFWRQEILNDYDYYWRVEPDIKLYCDIDYDIFKWMKDNNKDYAFTISLPEYKETIPTLWDTTKEFIEKNPQYLAQNNLMDWVSDDKGQTYNGCHFWSNFEIGSLAFWRSEAYRKYFEHLDKAGGFFYERWGDAPVHSIAAALFLPREKIHFFEDVGYYHVPFTNCPVDKEVRKARNCNCDPNKDFTWRGYSCTTKYYTLNNFKRQKGWEKYTA

Found 22 hits pdb id: 1s4n chain id: A PDB file for 1s4n not found pdb id: 1s4n chain id: B PDB file for 1s4n not found pdb id: 1s4o chain id: A PDB file for 1s4o not found pdb id: 1s4o chain id: B PDB file for 1s4o not found pdb id: 1s4p chain id: A PDB file for 1s4p not found pdb id: 1s4p chain id: B PDB file for 1s4p not found pdb id: 7boo chain id: A PDB file for 7boo not found pdb id: 7boo chain id: B PDB file for 7boo not found pdb id: 7boo chain id: C PDB file for 7boo not found pdb id: 7boo chain id: D PDB file for 7boo not found pdb id: 7boo chain id: E PDB file for 7boo not found pdb id: 7boo chain id: F PDB file for 7boo not found pdb id: 7bop chain id: A PDB file for 7bop not found pdb id: 7bop chain id: B PDB file for 7bop not found pdb id: 7bop chain id: C PDB file for 7bop not found pdb id: 7bop chain id: D PDB file for 7bop not found pdb id: 7bop chain id: E PDB file for 7bop not found pdb id: 7bop chain id: F PDB file for 7bop not found pdb id: 5a07 chain id: A PDB file for 5a07 not found pdb id: 5a07 chain id: B PDB file for 5a07 not found pdb id: 5a08 chain id: A PDB file for 5a08 not found pdb id: 5a08 chain id: B PDB file for 5a08 not found

2. Then I reorganized the pdb folder so that two-letter directory names to be the middle two characters of the PDBid. Like this:

pdb/s4/1s4p/1s4p_final.cif pdb/s4/1s4n/1s4n_final.cif pdb/xj/7xjv/7xjv_final.cif pdb/mm/5mm0/5mm0_final.cif

This resulted in a different output from verbose:

Blasting: MASTRSNARLIRFGIFALVLIGCGYILTRGSSFQPPNYQQTQSPAAHEKQTGNVAAGGGAGSGSAGAQVPLGKNRGPIPKAIMGAGEGGSDAPVPQQDIPDSYTLNDKIKATFVTLARNSDLYSLAESIRHVEDRFNKKFHYDWVFLNDEEFNDEFKETVGSLVSGNTKFGLIPKEHWSYPPWIDQEKAALVREQMREKKIIYGHSESYRHMCRFESGFFWRQEILNDYDYYWRVEPDIKLYCDIDYDIFKWMKDNNKDYAFTISLPEYKETIPTLWDTTKEFIEKNPQYLAQNNLMDWVSDDKGQTYNGCHFWSNFEIGSLAFWRSEAYRKYFEHLDKAGGFFYERWGDAPVHSIAAALFLPREKIHFFEDVGYYHVPFTNCPVDKEVRKARNCNCDPNKDFTWRGYSCTTKYYTLNNFKRQKGWEKYTA

Found 22 hits pdb id: 1s4n chain id: A PDB file for 1s4n not found pdb id: 1s4n chain id: B PDB file for 1s4n not found pdb id: 1s4o chain id: A hsp, identity 0.60 length 345 Nr of missing CA: 10 translate A: (-1.66,2.15,0.31) translate B: (-22.79,-55.29,-31.51) rotation: 240.19 degrees rotation around axis (-0.65,-0.20,-0.74) RMSd: 4.39 translate A: (-8.01,5.73,-8.41) translate B: (-8.22,5.85,-9.04) rotation: 356.96 degrees rotation around axis (0.09,0.93,0.36) RMSd: 0.30 CCP4 monomers library not found, CLIBD_MON is not defined Could not locate compound MN in the CCD components file Could not locate compound MET in the CCD components file Could not locate compound ALA in the CCD components file Could not locate compound SER in the CCD components file Could not locate compound THR in the CCD components file Could not locate compound ARG in the CCD components file Could not locate compound ASN in the CCD components file Could not locate compound LEU in the CCD components file Could not locate compound ILE in the CCD components file Could not locate compound PHE in the CCD components file Could not locate compound GLY in the CCD components file Could not locate compound VAL in the CCD components file Could not locate compound CYS in the CCD components file Could not locate compound TYR in the CCD components file Could not locate compound GLN in the CCD components file Could not locate compound PRO in the CCD components file Could not locate compound HIS in the CCD components file Could not locate compound GLU in the CCD components file Could not locate compound LYS in the CCD components file Could not locate compound ASP in the CCD components file Could not locate compound TRP in the CCD components file Error when processing 1s4o for CAMNT1

Trying to insert unknown compound MN (not found in CCD)

Strangely, when I added the `pdb-id-list.txt` file, which also contains structure `1s4o`, the output from verbose became totally uninformative:

Blasting: MASTRSNARLIRFGIFALVLIGCGYILTRGSSFQPPNYQQTQSPAAHEKQTGNVAAGGGAGSGSAGAQVPLGKNRGPIPKAIMGAGEGGSDAPVPQQDIPDSYTLNDKIKATFVTLARNSDLYSLAESIRHVEDRFNKKFHYDWVFLNDEEFNDEFKETVGSLVSGNTKFGLIPKEHWSYPPWIDQEKAALVREQMREKKIIYGHSESYRHMCRFESGFFWRQEILNDYDYYWRVEPDIKLYCDIDYDIFKWMKDNNKDYAFTISLPEYKETIPTLWDTTKEFIEKNPQYLAQNNLMDWVSDDKGQTYNGCHFWSNFEIGSLAFWRSEAYRKYFEHLDKAGGFFYERWGDAPVHSIAAALFLPREKIHFFEDVGYYHVPFTNCPVDKEVRKARNCNCDPNKDFTWRGYSCTTKYYTLNNFKRQKGWEKYTA

Found 22 hits


So I went back to NOT using the `pdb-id-list.txt` file. Lastly, I checked whether not using full paths in the conf file would affect the verbose output. Which it didn't.

In conclusion:
1. Using full paths in the conf file doesn't matter, although it is definitely a good practice.
2. Folder organization must comply with the two-letter directory names being equivalent to the middle two characters of the PDBid.
3. Using the `pdb-id-list.txt` file negatively affects the verbose output for some reason.
4. I appear to be missing a "CCD components file". How can I correct this?
drlemmus commented 11 months ago

The CCD componenets file is the description of all compounds in the PDB. You can get is from https://files.wwpdb.org/pub/pdb/data/monomers/components.cif

sadiogo commented 11 months ago

Thank you! And where should I place the file for alphafill to find it?

sadiogo commented 11 months ago

Nevermind, I did some investigating and found a components.cif file with 0kb in usr/local/share/libcifpp. I assumed this file was the problem and substituted it with the one I downloaded in the above link. It Worked! Alphafill has transplanted the ligands and generated a comprehensive json file

Thank you very much for the help and prompt replies!

sadiogo commented 11 months ago

However, I did notice that when I use the pdb-id-list.txt file, I get no hits. Two of the proteins in the list should have been detected, since they display 60% identity.

Any idea why this happens?

sadiogo commented 11 months ago

I found out the problem. I saw that pdb-id-list.txt was working when only one PDB-ID was given. However, the moment I added a new one, I got no hits.

So instead of separating the pdb-ids with commas, like the output of prepare-pdb-list, I kept each instance in individual lines:

1S4O
1S4P
2BO7
2BO8
2C59
5GGI
5MM0
6YV9
7E9K
7XJV

Now alphafill is working correctly with pdb-id-list.txt.

So I think the output of prepare-pdb-list needs to be fixed, or the parsing of pdb-id-list.txt needs to take comma-separated values into account.

I also experimented with the min-hsp-identity parameter. I found that it affects which hits will be considered transplantable, but it has no effect whatsoever on the amount of hits found. It appears that the list of possible hits is "set in stone", and will always amount to 22 for the model I used as input. Is it supposed to work this way? Is there any other way I can modify the list of hits?

drlemmus commented 11 months ago

Did you check whether the hit you expect is actually found? It is based on the sequence identity and the length of the HSP.

mhekkel commented 11 months ago

When you built libcifpp, you had the option to download the CCD file. If you did so, the file should have been installed in the proper location. 

If you want to download manually, you could try to place it in /var/cache/libcifpp

Or if you installed libcifpp in /usr/local, there should be a directory called /usr/local/share/libcifpp with some cif dictionary files in it. You can also place the CCD file there. 


Van: Diogo Martins de Sá @.***> verzonden: donderdag 28 september 2023 16:51 Aan: PDB-REDO/alphafill Cc: Maarten L. Hekkelman; Comment Onderwerp: Re: [PDB-REDO/alphafill] Alphafold can't find pdb files albeit finding them when generating the pdb-id-list (Issue #37)

Thank you! And where should I place the file for alphafill to find it? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

sadiogo commented 11 months ago

Did you check whether the hit you expect is actually found? It is based on the sequence identity and the length of the HSP.

It doesn't matter how much I change min-hsp-identity or min-alignment-length, I always get 22 hits. Both of these paramaters only influence on whether any of the 22 hits will be further selected for transplants (whether they meet the thresholds or not).

The hit I expect is not among the 22 hits, but is present in the fasta file. The 22 hits appear to have already been selected previously to the Blast step, as both parameters do not affect this number.

My question is: how are these 22 hit being selected?

Sorry for all these inquiries, I do appreciate the program very much.

drlemmus commented 11 months ago

Hmmm, you might be going beyond what our BLAST implementation likes. Which hit would you expect that is currently missing? If I blast your sequence against the PDB (on rcsb.org) I only get 1s4o and 1s4p as hits (from the ones on your list). NCBI BLAST on the PDB only returns 3 entries with an E-value below 0.01.

sadiogo commented 11 months ago

When you say "what our BLAST implementation likes", might this translate to saying alphafill has a hidden e-value cap? That would explain it. Could this cap become user-defined? Of course, the only reason to do this is because I'm providing a very reduced list of structure for alphafill to focus on. In a sense, I would be telling alphafill which results from blast, among many, are transplantable despite low identity and e-value. This would be at my own risk. I understand that superposition of the active sites and transplantation of the ligand my not be ideal, but I would make that judgement myself.

All of the structures in my list bear a rossman-fold subdomain which binds nucleosides, but I would expect at least 5ggi, 2bo8 and 7xjv to be high* e-value hits in the first 500 hits

drlemmus commented 11 months ago

They don't give low E-values, otherwise they would have turned up in an independent BLAST against the PDB. If you want to transplant compounds from models that only have a similar fold, I would suggest to do structural alignment (e.g. with something like SSM) and then manually transfer compounds of interest. You could then refine the ligand in with some sort of MM/MD calculation, but I would be extremely careful in drawing any conclusions. You really need experimental evidence to support your claims.

sadiogo commented 11 months ago

I meant high (bad) e-values, not low (good) ones.

Your suggestion is exactly what I did. Alphifill could only transplant GDP for me, but I wanted GDD (due to experimental evidence), which has an additional mannose attached. So I performed a superposition of the GDD-containing structure onto the alphafill output. However, this is not ideal because structural alignments focus on reducing the RMSD of the backbone chain, not the active site (residues +ligand). This can be mitigated by selecting just the interesting parts of the structure to perform the superposition (which is what alphafill does). Any how, I normally lean towards performing my own manual superposition of the ligand and active site residues using the edit mode in Pymol. Then I manually transfer the ligand and perform an energy minimization using YASARA.

This strategy has worked so far and alphafill has helped even further by transplanting a template ligand to which I can align other similar ligands. But I was trying to avoid perfoming my own manual superposition of the active sites. Since the superposition from alphafill is active site-oriented, I thought it might be able to acomplish this automatically. I am now experimenting with LigAlign. However, their python code has some issues which I'm still adjusting. I've already managed to make it work for identical ligands, but now I'm writing my own code to make it work for similar ligands.

mhekkel commented 11 months ago

Unfortunately for you, some of the blast parameters are hard coded at the moment. See: https://github.com/PDB-REDO/alphafill/blob/6ff64a53038b0aa37aafe72c9ae67c74f35b155f/src/alphafill.cpp#L401

I can make these parameters you can provide on the command line. I will have a look at that after the weekend. In the mean time you might be able to edit that single line in the code yourself. 


Van: Diogo Martins de Sá @.***> verzonden: donderdag 28 september 2023 19:21 Aan: PDB-REDO/alphafill Cc: Maarten L. Hekkelman; Comment Onderwerp: Re: [PDB-REDO/alphafill] Alphafill can't find pdb files albeit finding them when generating the pdb-id-list (Issue #37)

I found out the problem. I saw that pdb-id-list.txt was working when only one PDB-ID was given. However, the moment I added a new one, I got no hits. So instead of separating the pdb-ids with commas, like the output of prepare-pdb-list, I kept each instance in individual lines:

1S4O 1S4P 2BO7 2BO8 2C59 5GGI 5MM0 6YV9 7E9K 7XJV

Now alphafill is working correctly with pdb-id-list.txt. So I think the output of prepare-pdb-list needs to be fixed, or the parsing of pdb-id-list.txt needs to take comma-separated values into account. I also experimented with min-hsp-identity parameter. I found that it affects which hits will be considered transplantable, but it has no effect whatsoever on the amount of hits found. It appears that the list of possible hits is "set in stone", and will always amount to 22 for the model I used as input. Is it supposed to work this way? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>