PDB-REDO / alphafill

AlphaFill is an algorithm based on sequence and structure similarity that “transplants” missing compounds to the AlphaFold models. By adding the molecular context to the protein structures, the models can be more easily appreciated in terms of function and structure integrity.
https://alphafill.eu
BSD 2-Clause "Simplified" License
89 stars 16 forks source link

cif input problems #12

Closed mf-rug closed 1 year ago

mf-rug commented 1 year ago

Hi, the tool works well with structures downloaded from the alphafold database, but I'm having trouble making it accept cif files that come out of other programs, and especially custom colabfold predictions. As was the case in issue #11, yasara is particularly problematic:

$ yasara -txt
>loadpdb 5m10,download=yes
 - Model 1 loaded as object 1 (5034 atoms)
>delres !protein
 - OK
>savecif 1, test
 - OK
>exit

$ alphafill --verbose test.cif test_fill.cif
In Category pdbx_struct_oper_list the following mandatory fields are missing: type
undefined Category pdbx_model
In Category atom_sites the following mandatory fields are missing: entry_id
In Category database_PDB_matrix the following mandatory fields are missing: entry_id
In Category symmetry the following mandatory fields are missing: entry_id
In Category cell the following mandatory fields are missing: entry_id
In Category refine the following mandatory fields are missing: entry_id, pdbx_refine_id
Invalid mmCIF file.
In Category pdbx_struct_oper_list the following mandatory fields are missing: type
undefined Category pdbx_model
In Category atom_sites the following mandatory fields are missing: entry_id
In Category database_PDB_matrix the following mandatory fields are missing: entry_id
In Category symmetry the following mandatory fields are missing: entry_id
In Category cell the following mandatory fields are missing: entry_id
In Category refine the following mandatory fields are missing: entry_id, pdbx_refine_id
Invalid mmCIF file.
Missing residue for atom THR A:1 N [A:6]

...<many more Missing residue lines>...

Missing residue for atom GLY A:529 O [A:534]
Segmentation fault (core dumped)

when i prepare the same pdb in chimerax, it works. However, any pdb that comes out from a custom colabfold prediction, when opened with either yasara or chimerax and then saved as cif again leads to the Segmentation fault. I attach example files (original pdb from colabfold, and converted cifs).

testfiles.zip

Would be great if you could check your cif parser if it has to be so finicky, or implement pdb to cif conversion within alphafill, or otherwise provide more precise instructions on the requirements of the cif input. Thanks!!

mhekkel commented 1 year ago

You could try pdb2cif, a program that is part of the cif-tools package (https://github.com/PDB-REDO/cif-tools)

A seg fault is not nice, I agree.

mhekkel commented 1 year ago

BTW, precise instructions for mmCIF files are dictated by the mmcif_pdbx dictionary provided by the PDB.

mf-rug commented 1 year ago

Hello, thanks for the reply. Unfortunately, using the same files as before:

$ pdb2cif test.pdb
2022.10.12 11:50 | [pdb2cif]    INFO    - pdb2cif version: 1.0.4
2022.10.12 11:50 | [structure]  INFO    - Parsing pdb file: test.pdb
2022.10.12 11:50 | [structure]  INFO    - Writing new mmCif file: test.cif
$ alphafill --verbose test.cif test_fill.cif
Error trying to load file "test.cif"
Duplicate Key violation, cat: atom_site values: {group_PDB:C, id:C, type_symbol:., label_atom_id:MET, label_alt_id:A, label_comp_id:1, label_asym_id:1, label_seq_id:-4.912, pdbx_PDB_ins_code:-6.002, Cartn_x:-38.617, Cartn_y:1.0000, Cartn_z:61.0600, B_iso_or_equiv:1, pdbx_formal_charge:METAA, auth_seq_id:C, auth_comp_id:1, auth_asym_id:ATOM, auth_atom_id:4, pdbx_PDB_model_num:C}

I understand from the Yasara developers that #11 was in fact due to some problem with the cif format on the alphafill side, are you certain this is not again the case here?

This is the first time I've had to deal with cif files and boy oh boy does this format have compatibility and standardization issues. It'd be a big help if developers can support users here and prevent unnecessary format checks. Is it really important what all these fields look like, surely the only important information in the file for the purpose at hand is the coordinates of the atoms?

drlemmus commented 1 year ago

Could you send us the pdb file (test.pdb) so we can check what is going on?

For the record: the whole point of the more formal description of the mmCIF format is to avoid the problems with file structure that were so common in PDB files (or things that were made to look as PDB files). Structure models contain much more information than a set of atomic coordinates and this can be passed if there is a formal data structure (the mmCIF dictionary) to which all developers stick. So as long as all programs write valid mmCIF, all programs with a proper mmCIF parser can read it. There are complications in this particularly because a lot of software is still written from a hand-waving PDB format perspective (who needs more than coordinates?), but a lot of developers are trying to get this sorted in order to achieve better compatibility than we ever had with the PDB format.

mf-rug commented 1 year ago

The files were attached in above comment, here again the pdb (changed the file extension from pdb to txt to be able to upload to github)

test.txt

Regarding the format, I don't want to argue with you (don't bite the hand that feeds you), but your argument of important other info enabled by cif is moot in this case since I am converting pdb to cif for the input and thus there can, by definition, not be any more information in the cif than the pdb already had. The output from ColabFold (and AF) is literally only the positions of atoms plus pLDDT in the b-factor column, and the resulting pdb, at least from CF, is thus also nothing but lines of ATOM .... Also keep in mind that I did use your pdb2cif converter to make the cif file that alphafill rejects.

PS: your tool is great and I appreciate your help on this issue immensely!

drlemmus commented 1 year ago

Thanks for the file. We don't deal with ColabFold yet and only with the AlphaFold DB that has proper mmCIF files. The file you sent isn't a proper pdb file which pdb2cif should not have converted in the first place. The mmCIF file that came out looks garbled (the values in the keys are offset by two positions). You can try two things to see if those help: 1) complete the pdb file such that it has at least a HEADER line and a CRYST1 line. You can model these after a valid file from the PDB (e.g. 1d3z). This should help with the conversion to some extent. 2) Either remove the MODEL and ENDMDL lines or correct the offset of the model number. PDB files are column formatted and justification matters. You can use 1d3z as an example again.

mf-rug commented 1 year ago

Hi, thanks for the hints as to what's actually going wrong, here some more (unsuccessful tests):

==> test1.pdb <== HEADER HYDROLASE 01-OCT-99 1D3Z CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1 MODEL 1

==> test2.pdb <== ATOM 1 N MET B 1 -3.903 -8.208 -38.212 1.00 61.06 N ATOM 2 CA MET B 1 -3.806 -6.782 -37.914 1.00 61.06 C ATOM 3 C MET B 1 -4.912 -6.002 -38.617 1.00 61.06 C

==> test3.pdb <== HEADER HYDROLASE 01-OCT-99 1D3Z CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1 ATOM 1 N MET B 1 -3.903 -8.208 -38.212 1.00 61.06

   --> all of them gave the `Duplicate key violation` error

- I further inspected the converted cif file from the original test.pdb and noticed a formating issue I didn't see in 1d3z.cif:

test.cif marks chains as A and B, or contains additional "A", both highlighted with ^

ATOM 1 N N . MET A 1 1 ? -3.903 -8.208 -38.212 1.0000 61.0600 ? 1 METAA N 1 ... ATOM 3123 N N . MET B 2 1 ? -20.709 6.466 35.278 1.0000 60.5700 ? 1 METAB N 1 ^ ^^

1d3z.cif

ATOM 1 N N . MET A 1 1 ? 52.923 -90.016 8.509 1.00 9.67 ? 1 MET A N 1
... ATOM 1232 N N . MET A 1 1 ? 54.015 -88.009 9.498 1.00 9.67 ? 1 MET A N 2

remove one or two of these issues

sed -E 's/ ([A-Z]{3})A(A|B) / \1 \2 /' test.cif > testA.cif sed -E 's/. ([A-Z]{3}) (B) /. \1 A /' test.cif > testB.cif sed -E 's/ ([A-Z]{3})A(A|B) / \1 \2 /' test.cif | sed -E 's/. ([A-Z]{3}) (B) /. \1 A /' > testAB.cif sed -E 's/ ([A-Z]{3})A(A|B) / \1 \2 /' test.cif | sed -E 's/ ([A-Z]{3}) B / \1 A /g' > testAA.cif

--> testA.cif, testAB.cif, and testAA.cif give

missing mandatory field entry_id for Category struct_keywords missing mandatory field entry_id for Category struct missing mandatory field entry_id for Category exptl missing mandatory field entry_id for Category symmetry missing mandatory field entry_id for Category cell Invalid mmCIF file. missing mandatory field entry_id for Category struct_keywords missing mandatory field entry_id for Category struct missing mandatory field entry_id for Category exptl missing mandatory field entry_id for Category symmetry missing mandatory field entry_id for Category cell Invalid mmCIF file. Missing residue for atom MET A:1 N ... 3181337 Segmentation fault (core dumped)


while testB.cif gives `Duplicate Key violation`

Repeated all that with test1.pdb from the manual change (see above), same result

All files also in the zip. Any further ideas?

[tests.zip](https://github.com/PDB-REDO/alphafill/files/9786237/tests.zip)
drlemmus commented 1 year ago

Thanks for testing. This looks like a weird bug in pdb2cif. We will need to investigate at our side.

mhekkel commented 1 year ago

Hi mf-rug,

Couple of questions. The file test1.cif you sent us is not created with pdb2cif, unless you have a version of that software I'm not aware of. If I convert your test1.pdb file to a cif file using pdb2cif I get a perfectly valid mmCIF file that also is processed correctly with alphafill.

The test1.cif file you sent is completely invalid. Look at line 6358 and following, there's an entity section without data. Curious, but OK.

The real problem is the atom_site records. This file must have been created by simply chopping up a PDB file into atom_site records. However, the formatting goes wrong since the auth_asym_id is two characters and clashes with the auth_comp_id. There should have been a space between the two.

This means that the number of fields per record is one less than what the description of atom_site claims. That's why you get an error telling:

    parse error at line 6358: Unexpected token, expected Value but found LOOP

Now, about the crash in alphafill, that's a bug. Alphafill uses the pdbx_poly_seq_scheme category to find out what polymers are supposed to be present in a file. This category is missing in your cif files and alphafill did not handle this correctly.

I'll see if I can add some auto-correcting code to libcifpp to reconstruct missing pieces of data. But of course, feeding proper data, correctly formatted and containing all the required fields would help a lot.

best regards,

-maarten

Op 14-10-2022 om 14:25 schreef mf-rug:

Hi, thanks for the hints as to what's actually going wrong, here some more (unsuccessful tests):

  • Tried to open the questionable test.pdb in chimera and yasara and save it as pdb again, then run pdb2cif, then alphafill --> Gave the |Duplicate key violation| error again. Files attached, but to give an idea:

|$ grep -o '^[^ ]' test_chimfix.pdb | uniq SEQRES HELIX SHEET ATOM TER ATOM CONECT END $ grep -o '^[^ ]' test_yasfix.pdb | uniq REMARK SEQRES ATOM TER ATOM TER END |

  • Tried to manually adjust as you suggested and i) added HEADER and CRYST1 by copy pasting from 1d3z and added more spaces to MODEL line ii) removed MODEL and ENDMDL lines, iii) both

|$ head -3 test?.pdb ==> test0.pdb <== MODEL 1 ATOM 1 N MET B 1 -3.903 -8.208 -38.212 1.00 61.06 N ATOM 2 CA MET B 1 -3.806 -6.782 -37.914 1.00 61.06 C ==> test1.pdb <== HEADER HYDROLASE 01-OCT-99 1D3Z CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1 MODEL 1 ==> test2.pdb <== ATOM 1 N MET B 1 -3.903 -8.208 -38.212 1.00 61.06 N ATOM 2 CA MET B 1 -3.806 -6.782 -37.914 1.00 61.06 C ATOM 3 C MET B 1 -4.912 -6.002 -38.617 1.00 61.06 C ==> test3.pdb <== HEADER HYDROLASE 01-OCT-99 1D3Z CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1 ATOM 1 N MET B 1 -3.903 -8.208 -38.212 1.00 61.06 |

--> all of them gave the |Duplicate key violation| error

  • I further inspected the converted cif file from the original test.pdb and noticed a formating issue I didn't see in 1d3z.cif:

|# test.cif cotains additional "A" as highlighted with ^ ATOM 1 N N . MET A 1 1 ? -3.903 -8.208 -38.212 1.0000 61.0600 ? 1 METAA N 1 ... ATOM 3123 N N . MET B 2 1 ? -20.709 6.466 35.278 1.0000 60.5700 ? 1 METAB N 1 ^ ^ #1d3z.cif ATOM 1 N N . MET A 1 1 ? 52.923 -90.016 8.509 1.00 9.67 ? 1 MET A N 1 ... ATOM 1232 N N . MET A 1 1 ? 54.015 -88.009 9.498 1.00 9.67 ? 1 MET A N 2 #remove one or two of these issues sed -E 's/ ([A-Z]{3})A(A|B) / \1 \2 /' test.cif > testA.cif sed -E 's/. ([A-Z]{3}) (B) /. \1 A /' test.cif > testB.cif sed -E 's/ ([A-Z]{3})A(A|B) / \1 \2 /' test.cif | sed -E 's/. ([A-Z]{3}) (B) /. \1 A /' > testAB.cif sed -E 's/ ([A-Z]{3})A(A|B) / \1 \2 /' test.cif | sed -E 's/ ([A-Z]{3}) B / \1 A /g' > testAA.cif | |--> testA.cif, testAB.cif, and testAA.cif give | |missing mandatory field entry_id for Category struct_keywords missing mandatory field entry_id for Category struct missing mandatory field entry_id for Category exptl missing mandatory field entry_id for Category symmetry missing mandatory field entry_id for Category cell Invalid mmCIF file. missing mandatory field entry_id for Category struct_keywords missing mandatory field entry_id for Category struct missing mandatory field entry_id for Category exptl missing mandatory field entry_id for Category symmetry missing mandatory field entry_id for Category cell Invalid mmCIF file. Missing residue for atom MET A:1 N ... 3181337 Segmentation fault (core dumped) |

while testB.cif gives |Duplicate Key violation|

Repeated all that with test1.pdb from the manual change (see above), same result

All files also in the zip. Any further ideas?

tests.zip https://github.com/PDB-REDO/alphafill/files/9786237/tests.zip

— Reply to this email directly, view it on GitHub https://github.com/PDB-REDO/alphafill/issues/12#issuecomment-1278938016, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNA47G3M4NNBKUY5EYZFRTWDFGMXANCNFSM6AAAAAAQ7XKGDM. You are receiving this because you commented.Message ID: @.***>

-- Maarten L. Hekkelman http://www.hekkelman.com/

mf-rug commented 1 year ago

Hi, mea culpa! 🫣 In my desperation I had already looked for pdb to cif converter and installed another pdb2cif software I found on github. Upon your comment I installed cif-tools and thought I had removed the first, but $ pdb2cif was still executing the old, and clearly non-functional one. Sorry! I tried with the proper pdb2cif and it does work indeed with pdb files with manually inserted CRYST1 records. That's easy enough to address and just include a sed in my pipeline, but you may want to consider to change this behaviour in pdb2cif, given it's restricted legacy importance for x-ray structures. Many in silico models may not contain it. I also did a quick check, this line is e.g. also not written by yasara or chimerax when structures are built from scratch (e.g. Yasara's BuildRes), in addition to them not writing it when opening a pdb without it and then saving it again. In any case, thanks a ton for your help in sorting this out.

EDIT:

I was too quick, alphafill does work now with the file from before, but it still files for other files, this time with a different error:

$ head -2 mypdb.pdb
MODEL     1
ATOM      1  N   MET B   1      14.248  -9.467 -29.681  1.00 60.93           N

$ echo 'CRYST1    1.000    1.000    1.000  90.00  90.00  90.00 P 1           1' > newtest.pdb && cat mypdb.pdb >> newtest.pdb

$ head -3 newtest.pdb
CRYST1    1.000    1.000    1.000  90.00  90.00  90.00 P 1           1
MODEL     1
ATOM      1  N   MET B   1      14.248  -9.467 -29.681  1.00 60.93           N

$ pdb2cif newtest.pdb newtest.cif

$ alphafill --verbose newtest.cif newtest_filled.cif
missing mandatory field dict_version for Category audit_conform
Invalid mmCIF file.
missing mandatory field dict_version for Category audit_conform
Invalid mmCIF file.
CCP4 monomers library not found, CLIBD_MON is not defined
Blasting:
MSIIDLRSDTVTQPTAGMLEAMTAAATGDDVYGEDPTVNHLEAELARRLGFAEALFVPTGTMSNLLGLMAHCGRGDEYIV
GQQAHTYKYEGGGAAVLGSIQPQPLEVQADGSLDLDQVAAAIKPDDFHFARTRLLALENTMQGKVLPLSYLAQARAFTRE
HGLALHLDGARLYNAAVKLGVDARQITQHFDSVSVCLSKGLGAPVGSVLCGSADVIGKARRLRKMVGGGMRQAGILAAAG
LYALDQHVARLADDHANALLLADGLREAGYEVEPVQTNMVYVSMGNRAEALKAFASERGVKLSAAPRLRMVTHMDVDRAQ
IEQVIGTFIDFSRN

blast done in 7.0s seconds
Found 115 hits
pdb id: 4lnj    chain id: A
hsp, identity 0.54 length 323
alphafill: ~/alphafill/src/alphafill.cpp:838: int a_main(int, char* const*): Assertion `af_ix_trimmed[i] < af_res.size()' failed.
422170 Aborted                 (core dumped) 

I checked whether the Z value in the last column of the CRYST1 line is involved in it and changed it to 2 and 4, but same error. Adding HEADER doesn't make a difference either.

Here are the files: newtest.zip

mhekkel commented 1 year ago

closing this issue since the problem seems to be fixed.

mf-rug commented 1 year ago

I installed the latest release of alphafill, and this problem does not occur anymore. There is another issue that I will open separately.