failing to read PDB files generated by VMD

lmiq commented 4 months ago

When trying to read this file (and other similar ones), I get:

julia> read("vmd.pdb", BioStructures.PDB)
ERROR: Two copies of the same atom have the same alternative location ID. Existing atom:
Atom N with serial 1, coordinates [-41.156, -40.019, 21.411]
New atom record to add:
AtomRecord N with serial 1501, coordinates [-42.523, -41.722, -19.881]

This is one example VMD pdb file.

vmd.pdb.zip

jgreener64 commented 4 months ago

One way to store segment information would be to have a ChainID-like type that stores chain_id and segment_id. Then you can index into a Model with a ChainID; the current behaviour of indexing with a string or character would access the ChainID with that chain_id and no segment_id (or any segment_id, if there is only one chain with that chain_id).

We do currently read all fields from mmCIF files into a MMCIFDict and use a subset to create the molecular representation. All fields are available in the MMCIFDict, though, which is an advantage of that format: files can store arbitrary data in a codified way rather than using the columns left undefined in the PDB format. This ability to store arbitrary data is one of the reasons for the push to mmCIF and in my opinion is a good thing.

timholy commented 4 months ago

there are objective advantages of having a format for which one can copy and paste chunks of data

Yep. There is a table format (expand the "styling plans"), one that exists now and one which seems planned (not sure what that means...). But it clearly isn't universally used.

lmiq commented 4 months ago

Found another detail about reading MD PDB files: in these files usually 4-letter residue names are accepted, using the character in position 21 as the fourth character (which does not have any function in the original PDB format). For instance, this is a very common way to represent water:

ATOM   4013  OH2 TIP3B   1     -28.223  19.920 -27.748  1.00  0.00      WAT1 O
ATOM   4014  H1  TIP3B   1     -27.453  20.358 -27.476  1.00  0.00      WAT1 H
ATOM   4015  H2  TIP3B   1     -27.834  19.111 -28.148  1.00  0.00      WAT1 H

where TIP3 is the residue name. Currently BioStructures reads the residue name as TIP.

This raises an additional complication because blank characters are currently meaningful as the residue or atom keys (IMO they should be stripped, but I don't know if this adds additional complications because of the variable length of each key).

jgreener64 commented 4 months ago

To me that seems like a clear format violation.

I did consider whether to strip atom and residue names for storage, but decided against it as there are common cases like " CA " and "CA " representing C-alpha and calcium respectively. The getter functions atomname and resname have a strip keyword argument, which is true by default.

lmiq commented 4 months ago

To me that seems like a clear format violation.

Yes, it is, but a common one. The question is should this package support that, or do we need another parser?

common cases like " CA " and "CA " representing C-alpha and calcium respectively

Ough. I didn't know about that. I've ever seen Calcium as "CAL". Also do think that is standardized?

jgreener64 commented 4 months ago

We could have it behind a flag I guess.

Atom and residue names don't seem to be well standardised but if you store the spacing you can at least write it out the same way it was read in.

jgreener64 commented 4 months ago

Based on discussions here and on Slack a few changes have recently gone in:

Selection strings from @lmiq.
Faster PDB parsing from @timholy.
ProteinStructure has been renamed to MolecularStructure.
Not exporting x, x!, etc.
The BioSequences.jl and BioAlignments.jl dependencies have been moved to extensions.

Now is a good time to mention any other breaking changes of interest as the next version will be v4 (I will unify the selection functions a bit first).

Regarding the heaviness of the package, for me using BioStructures now takes 0.15 s on Julia 1.11.0-beta2. @time_imports using BioStructures gives the following for me in case anyone has further suggestions.

      1.1 ms  TranscodingStreams
      1.8 ms  BioGenerics
      0.2 ms  PrecompileTools
      5.8 ms  BioSymbols
               ┌ 0.1 ms Zlib_jll.__init__() 
     89.3 ms  Zlib_jll 97.10% compilation time
      0.8 ms  CodecZlib
      3.7 ms  Format
      0.7 ms  Serialization
      1.3 ms  MsgPack
      0.3 ms  MMTF
      9.3 ms  RecipesBase
               ┌ 44.1 ms STRIDE_jll.__init__() 99.53% compilation time (100% recompilation)
     44.4 ms  STRIDE_jll 98.86% compilation time (100% recompilation)
               ┌ 0.1 ms DSSP_jll.__init__() 
      0.4 ms  DSSP_jll
      0.6 ms  Statistics
     63.2 ms  BioStructures

STRIDE_jll (and DSSP_jll) could be moved to extensions, I'm not sure how conventional that is though.

On Julia 1.10.3 Zlib_jll and STRIDE_jll take considerably less time to load, though the overall load time is longer (0.26 s) since SparseArrays is not a weak dependency in Statistics.

lmiq commented 4 months ago

From the discussion here up to now, my impression is that we cannot escape having a new type (MDPDB maybe) for reading the PDBs of MD simulations with their loose standards (particularly, now, the 4-letter residues), so I don´t think developments on that front will bring breaking changes.

timholy commented 3 months ago

The JLL load time is presumably something that should be fixed elsewhere. I wouldn't tie yourself in knots to work around the issue in this package.

Nice progress on decreasing load time!

marcom commented 3 months ago

One solution to the nonstandard PDB files would be to have an additional data type that serves the equivalent job of MMCIFDict (maybe PDBFile) that simply represents what data is present in the file, regardless of other considerations. That would allow reading and working with these files, even if they are not necessarily considered correct.

BioJulia / BioStructures.jl

failing to read PDB files generated by VMD #48