PDB-REDO / dssp

Application to assign secondary structure to proteins
BSD 2-Clause "Simplified" License
166 stars 19 forks source link

AlphaFold pdb files fail with "Expected record CRYST1 but found ATOM" #49

Closed rcedgar closed 1 year ago

rcedgar commented 1 year ago

This error is generated by AlphaFold predicted structures, which do not have a CRYST1 record. IMHO this is overly stringent, should be a warning or not checked at all. I realize I can work around this by inserting a fake CRYST1 record, but this is very expensive for high-throughput applications, I'm dealing with 10^5 or 10^6 structures per batch.

I tried to fix this myself but got stuck trying to build libcifpp, I was unable to resolve this error:

CMake Error at /usr/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find ZLIB (missing: ZLIB_LIBRARY ZLIB_INCLUDE_DIR)

I tried installing zlib through apt (on Ubuntu) and also downloading and re-building from the zlib source then setting environment variables ZLIB_LIBRARY and ZLIB_INCLUDE_DIR, without success.

drlemmus commented 1 year ago

We are already quite forgiving with respect to the required records, see https://www.wwpdb.org/documentation/file-format-content/format33/sect1.html#Order

Assuming you made the models yourself, the overhead of adding the CRYST1 record is minute compared to the amount of CPU time needed to create 10^5 models. If you got the models elsewhere then it would be good to ask the model creator to provide valida PDB or mmCIF files.

rcedgar commented 1 year ago

There are now many millions of pdb files for predicted protein structures available for download, these are created by AlphaFold, RosettaFold and other AI-based structure predictors. These predicted structures are a major new resource for biologists. They are not compatible with your code, and there is no realistic chance that AF, RF... will fix their code or update their databases to be compatible with yours.

rcedgar commented 1 year ago

It is a trivial fix to your code to support an assumed default set of CRYST1 parameters -- I was planning to do this and submit a PR, but I could not build from source as noted at the top of this issue. Perhaps you could help me fix the build problem?

drlemmus commented 1 year ago

We will look at your build problem. Note that the available AlphaFold models are provided in mmCIF-modelCIF in AFDB and they are already annotated by DSSP. All the major model providers (except Meta) support modelCIF and a little bird told me there will be a paper out about that soon.

mhekkel commented 1 year ago

tried installing zlib through apt (on Ubuntu) and also downloading and re-building from the zlib source then setting environment variables ZLIB_LIBRARY and ZLIB_INCLUDE_DIR, without success.

But did you install zlib1g-dev, or just zlib? The former is the development package which is required to develop software using zlib.

rcedgar commented 1 year ago

Hi guys -- Thanks for the quick responses, and apologies for multiple errors on my part. I am actually getting big batches of pdb files from collaborators. I had assumed these were coming from the public databases but in fact it seems they were generated by colabfold, and they may have undergone intermediate processing (I'm not sure about the details). And yes, I forgot to install the -dev version of zlib. I'm all set now, appreciate the help.