Open lmiq opened 4 months ago
One way to store segment information would be to have a ChainID
-like type that stores chain_id
and segment_id
. Then you can index into a Model
with a ChainID
; the current behaviour of indexing with a string or character would access the ChainID
with that chain_id
and no segment_id
(or any segment_id
, if there is only one chain with that chain_id
).
We do currently read all fields from mmCIF files into a MMCIFDict
and use a subset to create the molecular representation. All fields are available in the MMCIFDict
, though, which is an advantage of that format: files can store arbitrary data in a codified way rather than using the columns left undefined in the PDB format. This ability to store arbitrary data is one of the reasons for the push to mmCIF and in my opinion is a good thing.
there are objective advantages of having a format for which one can copy and paste chunks of data
Yep. There is a table format (expand the "styling plans"), one that exists now and one which seems planned (not sure what that means...). But it clearly isn't universally used.
Found another detail about reading MD PDB files: in these files usually 4-letter residue names are accepted, using the character in position 21 as the fourth character (which does not have any function in the original PDB format). For instance, this is a very common way to represent water:
ATOM 4013 OH2 TIP3B 1 -28.223 19.920 -27.748 1.00 0.00 WAT1 O
ATOM 4014 H1 TIP3B 1 -27.453 20.358 -27.476 1.00 0.00 WAT1 H
ATOM 4015 H2 TIP3B 1 -27.834 19.111 -28.148 1.00 0.00 WAT1 H
where TIP3
is the residue name. Currently BioStructures
reads the residue name as TIP
.
This raises an additional complication because blank characters are currently meaningful as the residue or atom keys (IMO they should be stripped, but I don't know if this adds additional complications because of the variable length of each key).
To me that seems like a clear format violation.
I did consider whether to strip atom and residue names for storage, but decided against it as there are common cases like " CA "
and "CA "
representing C-alpha and calcium respectively. The getter functions atomname
and resname
have a strip
keyword argument, which is true
by default.
To me that seems like a clear format violation.
Yes, it is, but a common one. The question is should this package support that, or do we need another parser?
common cases like
" CA "
and"CA "
representing C-alpha and calcium respectively
Ough. I didn't know about that. I've ever seen Calcium as "CAL". Also do think that is standardized?
We could have it behind a flag I guess.
Atom and residue names don't seem to be well standardised but if you store the spacing you can at least write it out the same way it was read in.
Based on discussions here and on Slack a few changes have recently gone in:
ProteinStructure
has been renamed to MolecularStructure
.x
, x!
, etc.Now is a good time to mention any other breaking changes of interest as the next version will be v4 (I will unify the selection functions a bit first).
Regarding the heaviness of the package, for me using BioStructures
now takes 0.15 s on Julia 1.11.0-beta2. @time_imports using BioStructures
gives the following for me in case anyone has further suggestions.
1.1 ms TranscodingStreams
1.8 ms BioGenerics
0.2 ms PrecompileTools
5.8 ms BioSymbols
┌ 0.1 ms Zlib_jll.__init__()
89.3 ms Zlib_jll 97.10% compilation time
0.8 ms CodecZlib
3.7 ms Format
0.7 ms Serialization
1.3 ms MsgPack
0.3 ms MMTF
9.3 ms RecipesBase
┌ 44.1 ms STRIDE_jll.__init__() 99.53% compilation time (100% recompilation)
44.4 ms STRIDE_jll 98.86% compilation time (100% recompilation)
┌ 0.1 ms DSSP_jll.__init__()
0.4 ms DSSP_jll
0.6 ms Statistics
63.2 ms BioStructures
STRIDE_jll (and DSSP_jll) could be moved to extensions, I'm not sure how conventional that is though.
On Julia 1.10.3 Zlib_jll and STRIDE_jll take considerably less time to load, though the overall load time is longer (0.26 s) since SparseArrays is not a weak dependency in Statistics.
From the discussion here up to now, my impression is that we cannot escape having a new type (MDPDB
maybe) for reading the PDBs of MD simulations with their loose standards (particularly, now, the 4-letter residues), so I don´t think developments on that front will bring breaking changes.
The JLL load time is presumably something that should be fixed elsewhere. I wouldn't tie yourself in knots to work around the issue in this package.
Nice progress on decreasing load time!
One solution to the nonstandard PDB files would be to have an additional data type that serves the equivalent job of MMCIFDict
(maybe PDBFile
) that simply represents what data is present in the file, regardless of other considerations. That would allow reading and working with these files, even if they are not necessarily considered correct.
When trying to read this file (and other similar ones), I get:
This is one example VMD pdb file.
vmd.pdb.zip