Closed sadiogo closed 9 months ago
First of all, use full paths in your config file. That may help. And then use the --verbose flag for the process step, it will tell you why PDB entries are skipped.
The pdb-id-list contains the ID's of PDB files to process, not the ones to skip.
Output from alphafill should be the same as the files you can download from the website for alphafill. So, a cif with ligands and a json with details about the bound ligands.
So, try the --verbose flag and if that doesn't help, post the output here so I can have a look.
Also the typical file organisation requires the two-letter directory names to be the middle two characters of the PDBid.
Okay, I figured out some things. I will explain everything I did because it might be important going forward.
Blasting:
MASTRSNARLIRFGIFALVLIGCGYILTRGSSFQPPNYQQTQSPAAHEKQTGNVAAGGGAGSGSAGAQVPLGKNRGPIPKAIMGAGEGGSDAPVPQQDIPDSYTLNDKIKATFVTLARNSDLYSLAESIRHVEDRFNKKFHYDWVFLNDEEFNDEFKETVGSLVSGNTKFGLIPKEHWSYPPWIDQEKAALVREQMREKKIIYGHSESYRHMCRFESGFFWRQEILNDYDYYWRVEPDIKLYCDIDYDIFKWMKDNNKDYAFTISLPEYKETIPTLWDTTKEFIEKNPQYLAQNNLMDWVSDDKGQTYNGCHFWSNFEIGSLAFWRSEAYRKYFEHLDKAGGFFYERWGDAPVHSIAAALFLPREKIHFFEDVGYYHVPFTNCPVDKEVRKARNCNCDPNKDFTWRGYSCTTKYYTLNNFKRQKGWEKYTA
Found 22 hits pdb id: 1s4n chain id: A PDB file for 1s4n not found pdb id: 1s4n chain id: B PDB file for 1s4n not found pdb id: 1s4o chain id: A PDB file for 1s4o not found pdb id: 1s4o chain id: B PDB file for 1s4o not found pdb id: 1s4p chain id: A PDB file for 1s4p not found pdb id: 1s4p chain id: B PDB file for 1s4p not found pdb id: 7boo chain id: A PDB file for 7boo not found pdb id: 7boo chain id: B PDB file for 7boo not found pdb id: 7boo chain id: C PDB file for 7boo not found pdb id: 7boo chain id: D PDB file for 7boo not found pdb id: 7boo chain id: E PDB file for 7boo not found pdb id: 7boo chain id: F PDB file for 7boo not found pdb id: 7bop chain id: A PDB file for 7bop not found pdb id: 7bop chain id: B PDB file for 7bop not found pdb id: 7bop chain id: C PDB file for 7bop not found pdb id: 7bop chain id: D PDB file for 7bop not found pdb id: 7bop chain id: E PDB file for 7bop not found pdb id: 7bop chain id: F PDB file for 7bop not found pdb id: 5a07 chain id: A PDB file for 5a07 not found pdb id: 5a07 chain id: B PDB file for 5a07 not found pdb id: 5a08 chain id: A PDB file for 5a08 not found pdb id: 5a08 chain id: B PDB file for 5a08 not found
2. Then I reorganized the pdb folder so that two-letter directory names to be the middle two characters of the PDBid. Like this:
pdb/s4/1s4p/1s4p_final.cif pdb/s4/1s4n/1s4n_final.cif pdb/xj/7xjv/7xjv_final.cif pdb/mm/5mm0/5mm0_final.cif
This resulted in a different output from verbose:
Blasting: MASTRSNARLIRFGIFALVLIGCGYILTRGSSFQPPNYQQTQSPAAHEKQTGNVAAGGGAGSGSAGAQVPLGKNRGPIPKAIMGAGEGGSDAPVPQQDIPDSYTLNDKIKATFVTLARNSDLYSLAESIRHVEDRFNKKFHYDWVFLNDEEFNDEFKETVGSLVSGNTKFGLIPKEHWSYPPWIDQEKAALVREQMREKKIIYGHSESYRHMCRFESGFFWRQEILNDYDYYWRVEPDIKLYCDIDYDIFKWMKDNNKDYAFTISLPEYKETIPTLWDTTKEFIEKNPQYLAQNNLMDWVSDDKGQTYNGCHFWSNFEIGSLAFWRSEAYRKYFEHLDKAGGFFYERWGDAPVHSIAAALFLPREKIHFFEDVGYYHVPFTNCPVDKEVRKARNCNCDPNKDFTWRGYSCTTKYYTLNNFKRQKGWEKYTA
Found 22 hits pdb id: 1s4n chain id: A PDB file for 1s4n not found pdb id: 1s4n chain id: B PDB file for 1s4n not found pdb id: 1s4o chain id: A hsp, identity 0.60 length 345 Nr of missing CA: 10 translate A: (-1.66,2.15,0.31) translate B: (-22.79,-55.29,-31.51) rotation: 240.19 degrees rotation around axis (-0.65,-0.20,-0.74) RMSd: 4.39 translate A: (-8.01,5.73,-8.41) translate B: (-8.22,5.85,-9.04) rotation: 356.96 degrees rotation around axis (0.09,0.93,0.36) RMSd: 0.30 CCP4 monomers library not found, CLIBD_MON is not defined Could not locate compound MN in the CCD components file Could not locate compound MET in the CCD components file Could not locate compound ALA in the CCD components file Could not locate compound SER in the CCD components file Could not locate compound THR in the CCD components file Could not locate compound ARG in the CCD components file Could not locate compound ASN in the CCD components file Could not locate compound LEU in the CCD components file Could not locate compound ILE in the CCD components file Could not locate compound PHE in the CCD components file Could not locate compound GLY in the CCD components file Could not locate compound VAL in the CCD components file Could not locate compound CYS in the CCD components file Could not locate compound TYR in the CCD components file Could not locate compound GLN in the CCD components file Could not locate compound PRO in the CCD components file Could not locate compound HIS in the CCD components file Could not locate compound GLU in the CCD components file Could not locate compound LYS in the CCD components file Could not locate compound ASP in the CCD components file Could not locate compound TRP in the CCD components file Error when processing 1s4o for CAMNT1
Trying to insert unknown compound MN (not found in CCD)
Strangely, when I added the `pdb-id-list.txt` file, which also contains structure `1s4o`, the output from verbose became totally uninformative:
Blasting: MASTRSNARLIRFGIFALVLIGCGYILTRGSSFQPPNYQQTQSPAAHEKQTGNVAAGGGAGSGSAGAQVPLGKNRGPIPKAIMGAGEGGSDAPVPQQDIPDSYTLNDKIKATFVTLARNSDLYSLAESIRHVEDRFNKKFHYDWVFLNDEEFNDEFKETVGSLVSGNTKFGLIPKEHWSYPPWIDQEKAALVREQMREKKIIYGHSESYRHMCRFESGFFWRQEILNDYDYYWRVEPDIKLYCDIDYDIFKWMKDNNKDYAFTISLPEYKETIPTLWDTTKEFIEKNPQYLAQNNLMDWVSDDKGQTYNGCHFWSNFEIGSLAFWRSEAYRKYFEHLDKAGGFFYERWGDAPVHSIAAALFLPREKIHFFEDVGYYHVPFTNCPVDKEVRKARNCNCDPNKDFTWRGYSCTTKYYTLNNFKRQKGWEKYTA
Found 22 hits
So I went back to NOT using the `pdb-id-list.txt` file. Lastly, I checked whether not using full paths in the conf file would affect the verbose output. Which it didn't.
In conclusion:
1. Using full paths in the conf file doesn't matter, although it is definitely a good practice.
2. Folder organization must comply with the two-letter directory names being equivalent to the middle two characters of the PDBid.
3. Using the `pdb-id-list.txt` file negatively affects the verbose output for some reason.
4. I appear to be missing a "CCD components file". How can I correct this?
The CCD componenets file is the description of all compounds in the PDB. You can get is from https://files.wwpdb.org/pub/pdb/data/monomers/components.cif
Thank you! And where should I place the file for alphafill to find it?
Nevermind, I did some investigating and found a components.cif file with 0kb in usr/local/share/libcifpp
. I assumed this file was the problem and substituted it with the one I downloaded in the above link. It Worked! Alphafill has transplanted the ligands and generated a comprehensive json file
Thank you very much for the help and prompt replies!
However, I did notice that when I use the pdb-id-list.txt
file, I get no hits. Two of the proteins in the list should have been detected, since they display 60% identity.
Any idea why this happens?
I found out the problem. I saw that pdb-id-list.txt
was working when only one PDB-ID was given. However, the moment I added a new one, I got no hits.
So instead of separating the pdb-ids with commas, like the output of prepare-pdb-list
, I kept each instance in individual lines:
1S4O
1S4P
2BO7
2BO8
2C59
5GGI
5MM0
6YV9
7E9K
7XJV
Now alphafill is working correctly with pdb-id-list.txt
.
So I think the output of prepare-pdb-list
needs to be fixed, or the parsing of pdb-id-list.txt
needs to take comma-separated values into account.
I also experimented with the min-hsp-identity
parameter. I found that it affects which hits will be considered transplantable, but it has no effect whatsoever on the amount of hits found. It appears that the list of possible hits is "set in stone", and will always amount to 22 for the model I used as input. Is it supposed to work this way? Is there any other way I can modify the list of hits?
Did you check whether the hit you expect is actually found? It is based on the sequence identity and the length of the HSP.
When you built libcifpp, you had the option to download the CCD file. If you did so, the file should have been installed in the proper location.
If you want to download manually, you could try to place it in /var/cache/libcifpp
Or if you installed libcifpp in /usr/local, there should be a directory called /usr/local/share/libcifpp with some cif dictionary files in it. You can also place the CCD file there.
Van: Diogo Martins de Sá @.***> verzonden: donderdag 28 september 2023 16:51 Aan: PDB-REDO/alphafill Cc: Maarten L. Hekkelman; Comment Onderwerp: Re: [PDB-REDO/alphafill] Alphafold can't find pdb files albeit finding them when generating the pdb-id-list (Issue #37)
Thank you! And where should I place the file for alphafill to find it? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Did you check whether the hit you expect is actually found? It is based on the sequence identity and the length of the HSP.
It doesn't matter how much I change min-hsp-identity
or min-alignment-length
, I always get 22 hits. Both of these paramaters only influence on whether any of the 22 hits will be further selected for transplants (whether they meet the thresholds or not).
The hit I expect is not among the 22 hits, but is present in the fasta file. The 22 hits appear to have already been selected previously to the Blast step, as both parameters do not affect this number.
My question is: how are these 22 hit being selected?
Sorry for all these inquiries, I do appreciate the program very much.
Hmmm, you might be going beyond what our BLAST implementation likes. Which hit would you expect that is currently missing? If I blast your sequence against the PDB (on rcsb.org) I only get 1s4o and 1s4p as hits (from the ones on your list). NCBI BLAST on the PDB only returns 3 entries with an E-value below 0.01.
When you say "what our BLAST implementation likes", might this translate to saying alphafill has a hidden e-value cap? That would explain it. Could this cap become user-defined? Of course, the only reason to do this is because I'm providing a very reduced list of structure for alphafill to focus on. In a sense, I would be telling alphafill which results from blast, among many, are transplantable despite low identity and e-value. This would be at my own risk. I understand that superposition of the active sites and transplantation of the ligand my not be ideal, but I would make that judgement myself.
All of the structures in my list bear a rossman-fold subdomain which binds nucleosides, but I would expect at least 5ggi, 2bo8 and 7xjv to be high* e-value hits in the first 500 hits
They don't give low E-values, otherwise they would have turned up in an independent BLAST against the PDB. If you want to transplant compounds from models that only have a similar fold, I would suggest to do structural alignment (e.g. with something like SSM) and then manually transfer compounds of interest. You could then refine the ligand in with some sort of MM/MD calculation, but I would be extremely careful in drawing any conclusions. You really need experimental evidence to support your claims.
I meant high (bad) e-values, not low (good) ones.
Your suggestion is exactly what I did. Alphifill could only transplant GDP for me, but I wanted GDD (due to experimental evidence), which has an additional mannose attached. So I performed a superposition of the GDD-containing structure onto the alphafill output. However, this is not ideal because structural alignments focus on reducing the RMSD of the backbone chain, not the active site (residues +ligand). This can be mitigated by selecting just the interesting parts of the structure to perform the superposition (which is what alphafill does). Any how, I normally lean towards performing my own manual superposition of the ligand and active site residues using the edit mode in Pymol. Then I manually transfer the ligand and perform an energy minimization using YASARA.
This strategy has worked so far and alphafill has helped even further by transplanting a template ligand to which I can align other similar ligands. But I was trying to avoid perfoming my own manual superposition of the active sites. Since the superposition from alphafill is active site-oriented, I thought it might be able to acomplish this automatically. I am now experimenting with LigAlign. However, their python code has some issues which I'm still adjusting. I've already managed to make it work for identical ligands, but now I'm writing my own code to make it work for similar ligands.
Unfortunately for you, some of the blast parameters are hard coded at the moment. See: https://github.com/PDB-REDO/alphafill/blob/6ff64a53038b0aa37aafe72c9ae67c74f35b155f/src/alphafill.cpp#L401
I can make these parameters you can provide on the command line. I will have a look at that after the weekend. In the mean time you might be able to edit that single line in the code yourself.
Van: Diogo Martins de Sá @.***> verzonden: donderdag 28 september 2023 19:21 Aan: PDB-REDO/alphafill Cc: Maarten L. Hekkelman; Comment Onderwerp: Re: [PDB-REDO/alphafill] Alphafill can't find pdb files albeit finding them when generating the pdb-id-list (Issue #37)
I found out the problem. I saw that pdb-id-list.txt was working when only one PDB-ID was given. However, the moment I added a new one, I got no hits. So instead of separating the pdb-ids with commas, like the output of prepare-pdb-list, I kept each instance in individual lines:
1S4O 1S4P 2BO7 2BO8 2C59 5GGI 5MM0 6YV9 7E9K 7XJV
Now alphafill is working correctly with pdb-id-list.txt. So I think the output of prepare-pdb-list needs to be fixed, or the parsing of pdb-id-list.txt needs to take comma-separated values into account. I also experimented with min-hsp-identity parameter. I found that it affects which hits will be considered transplantable, but it has no effect whatsoever on the amount of hits found. It appears that the list of possible hits is "set in stone", and will always amount to 22 for the model I used as input. Is it supposed to work this way? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Hello!
I`ve been having some trouble running alphafill. I've managed to create the conf file and correctly provide paths for the ligands, pdb-dir and pdb-fasta files. It looks like this:
While the fasta file is the original seq file from https://pdb-redo.eu/others/pdbredo_seqdb.txt, my PDB database contains only 10 cif files downloaded manually from PDB-REDO. This database was organized as previously mentioned here, wherein cif files were placed in 4-letter folders (PDB-ID) and these folders were all placed in a 2-character folder named
00
:When I run
alphafill process input.cif output.cif
I get the following message:pdbredo-seqdb.txt
fasta file, thereby returning 7 hits with more than 25% identity, but was unable to find the structure file for these 7 hits. Is this correct?Then I decided to create a pdb-id-list file by running
alphafill prepare-pdb-list > pdb-id-list.txt
. This correctly generated a pdb-id-list file containing the 10 proteins in my pdb folder:Then, when I run
alphafill process input.cif output.cif --pdb-id-list=pdb-id-list.txt
, I don't get the previous error messages, but the output.cif file contains none of the ligands and the json file shows no hits:I have three further questions:
When reading
alphafill -h
, it states that pdb-id-list will contain the proteins that have transplantable ligands. However, this is in contradiction with what is written in the README.md file, which states that the list will contain proteins "that can be skipped since they do not contain interesting ligands". Which one is true?What would be the successfull outputs from alphafill? For example, the json file would contain the hits with unique transplantable ligands and the cif file would depict my protein with all the ligands bound?
Any idea why running
alphafill process
without the pdb-id-list results in the "PDB file not found" traceback? For example, the structure of 1s4p is available in the pdb folder but alphafill stated it could not find it.I have also tried changing the
min-hsp-identity
setting to either 0.1 or 0.9 to see if different proteins would pop up in the "PDB file not found" messages (assuming those were the positive hits from blast), but always got the same results.