failing to read PDB files generated by VMD

lmiq commented 1 month ago

When trying to read this file (and other similar ones), I get:

julia> read("vmd.pdb", BioStructures.PDB)
ERROR: Two copies of the same atom have the same alternative location ID. Existing atom:
Atom N with serial 1, coordinates [-41.156, -40.019, 21.411]
New atom record to add:
AtomRecord N with serial 1501, coordinates [-42.523, -41.722, -19.881]

This is one example VMD pdb file.

vmd.pdb.zip

jgreener64 commented 1 month ago

The offending PDB lines are

ATOM      1  N   POPE    1     -41.156 -40.019  21.411  1.00  0.00      L11  N

and

ATOM   1501  N   POPE    1     -42.523 -41.722 -19.881  1.00  0.00      L21  N

which both try to put an atom with name N onto residue 1 of the chain with empty chain ID.

We treat this defensively since there are no or very few cases in the PDB that have such duplicate atoms. I reported a handful of cases to the PDB a while ago and they updated the records, so I believe it is considered a format violation.

In this case you could use different chain IDs (the empty column 22) or keep incrementing the residue number.

timholy commented 1 month ago

Since this file was written by VMD, I assume this should be reported to them? @lmiq, can you do that?

lmiq commented 1 month ago

Well, there is nothing wrong with that file for the use it was designed for, so I do not think there's anything to report there, upstream. The parsing of the residues in that file is simply done by incrementing the residue counter when the residue index changes (for more or less). Having limitations in the number of residues per chain, or number of chains, etc, is something that cannot be important in MD simulations PDB files.

This is the choice to be made here: have or not the possibility of parsing non-standard files to some degree.

I perfectly understand if BioStructures keeps strictly adhering to the standard, but that would not allow us using it in the most broad context of MD simulations.

Note that this limitation, specifically, is associated to trying to read the data into the hierarchical structure, so in some sense this is related to that initial choice of representation.

Also, note the L11 and L21 segment identifiers. These are used as the "top" level of hierarchy in the context of VMD and some simulations packages, but are barely used for reporting experimental structures.

timholy commented 1 month ago

If it's true that it violates https://www.wwpdb.org/documentation/file-format (I haven't checked), I'd say that's clearly a problem. It's not OK that there's a workaround. It introduces ambiguities and issues, precisely as being discussed here.

But if the format is ambiguous, then that's another issue entirely.

jgreener64 commented 1 month ago

It's certainly challenging to balance the considerations.

Note that this limitation, specifically, is associated to trying to read the data into the hierarchical structure, so in some sense this is related to that initial choice of representation.

This is the key, BioStructures guarantees that every atom has "meaning" about what it is (i.e. on which residue and chain), which is useful in many contexts (such as file interchange) but in turn necessitates some complexities in representation and strictness in parsing.

The only way the original case could work currently is if the existing atom is overwritten (or the new atom ignored) with a warning, which is unlikely to be desired behaviour even if it runs without error.

I guess philosophically the aim of BioStructures is to not so much to read structural files, but to represent unambiguously the molecules within them. This affects other design decisions, such as why countatoms will treat disordered versions of the same atom as one (unless expand_disordered is used).

But if the format is ambiguous, then that's another issue entirely.

Sadly, this seems to be the case. There is a lot of documentation on the column format, but as far as I know not so much on the row format (e.g. what duplicates mean). In general I prefer it when other tool authors use MODEL blocks to distinguish conformations in a MD trajectory though. That works well in BioStructures.

lmiq commented 1 month ago

From the description, that file does not adhere to the standard because:

- Non-blank alphanumerical character is used for chain identifier.

and there the chain identifier is blank. But the standard does not say anything about two residues in the same chain having the same number (which is actually the issue here). For actual bio-structures it is just assumed that that does not make sense. But when part of the system is thousands of water molecules, that is just unnecessarily limiting.

timholy commented 1 month ago

Given that the format is ambiguous, then perhaps we do need to deal with ambiguity. I don't think this should be open-ended: it's kind of a disaster to take the strategy "can I find something that seems to make sense in column 5? No? OK, let's see if things seem more sensible if I substitute the value in column 11? No? OK, maybe I can infer a good value from column 13?" But if instead one passes flavor=:vmd to the reader, and that follows strict rules, things are more sensible.

timholy commented 1 month ago

when part of the system is thousands of water molecules

is this part of the issue? That (legacy) PDB only lets you use a single character for encoding the chain? In which case shouldn't VMD be writing mmCIF instead?

lmiq commented 1 month ago

As a minimal example:

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     2     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     2     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     2     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     1      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     1      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     1      13.304   6.973  29.226  0.00  0.00      HATI

VMD, Pymol, Packmol, and other common simulation packages interpret this without problems. There are three water molecules there.

The issue is that this does not fit into an hierarchical structure, unless the reader creates chains arbitrarily. That would come with a bunch of problems as well.

is this part of the issue? That (legacy) PDB only lets you use a single character for encoding the chain? In which case shouldn't VMD be writing mmCIF instead?

Well, in some sense yes, the PDB format is problematic. But that ship has sailed decades ago in the MD field. Not that there aren't other formats that are more appropriate, particularly for the limitations of the coordinates fields, but the legacy PDB format is still widely used because of its readability. This is something we (MD simulation package authors) just have to live with.

ps: MIToS is reading this if one adds occupancy and b-factor fields (added now), because its hierarchy is at the residue level only, and it just not cares about the "meaning" of two residues with identical numbers in the same chain.

timholy commented 1 month ago

So maybe we need a flavor=:md mode or something? @jgreener64 would that be acceptable?

jgreener64 commented 1 month ago

Yes, but the question still remains about how to put the two atoms into the same hierarchical structure.

We could assign an unused residue number or chain ID, but that feels clunky and runs into problems if the assigned value is found later during reading. We could have another data type that stores multiple coordinates for the same atom, but that raises concerns about how to write the file back out in the same order. We could assign duplicates into a new Model, which would mimic MODEL/ENDMDL tags but doesn't fit the case of many waters in a MD file.

lmiq commented 1 month ago

Maybe the flavor should be the type of hierarchy desired: :segment, :chain, :residue (MIToS now), :atom (PDBTools) or whatever. I don't know, however, if that would make most of the other functionality of BioStructures fail to interoperate now, and thus the package very complex.

timholy commented 1 month ago

I guess my thought is that we decide on concrete flavor-specific behavior and document it. Passing a flavor might require that the entire file be scanned first to identify "occupied" names/indices/whatever, which would be a bit of a performance hit, but perhaps the price of abusing a file format.

lmiq commented 1 month ago

I think it is a little more complicated than that. If we keep the hierarchy we are forced to attribute some of the fields arbitrarily. That will cause confusion.

timholy commented 1 month ago

Isn't the hierarchy about how you assign bonds? Meaning, you have implicit hierarchy any time you want to link entries in rows together.

jgreener64 commented 1 month ago

The issue is that multiple atoms want the same "spot" in the hierarchy, because the file contains multiple objects which have the same name. So you either have to discard one atom, have multiple versions of the hierarchy, or store duplicate atoms in that spot (beyond disorder, which we already do).

We could have a flag renumber_residues when reading that ignores the residue numbering in the file, and increments the assigned residue number every time the residue number changes. This would read cases like

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     2     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     2     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     2     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     1      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     1      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     1      13.304   6.973  29.226  0.00  0.00      HATI

as

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     2     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     2     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     2     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     3      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     3      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     3      13.304   6.973  29.226  0.00  0.00      HATI

but would still error on

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     1     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     1     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     1     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     1      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     1      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     1      13.304   6.973  29.226  0.00  0.00      HATI

Renumbering like this is a useful feature anyway that is available in a number of PDB packages.

lmiq commented 1 month ago

I'm not sure if I understand your comment. In the MD files, I think it is safe to assume that the "residue" is the object by excellence. But the "residue number", or "name", "chain", or any other property, are just meaningless labels that help one to draw or select different parts of the structure.

Thus, if there was an option to just flatten the hierarchy to have "residues" at the top level, the MD files could be parsed without major issues (as MIToS does now, and PDBTools parses at the atom level but does provide an iterator over residues).

What I can´t say is that having the file parsed in a different hierarchy would require all the other functions of BioStructures to have additional methods, thus effectively creating a parallel package within it.

By the way, I noticed now that:

julia> pdb = read("./test.pdb", BioStructures.PDB)
ProteinStructure test.pdb with 1 models, 2 chains (A, ), 284 residues, 2078 atoms

julia> typeof(pdb)
BioStructures.ProteinStructure

that per-se would not fit quite well in MD files, as many of them do not have any protein whatsoever.

lmiq commented 1 month ago

We could have a flag renumber_residues when reading that ignores the residue numbering in the file,

That would help, but I think it would still fall short, as the hierarchy on chains is not really meaningful in this context.

Concerning this:

ATOM   4247  H   HOH     1     -28.310   3.370 -27.343  1.00  0.00      HATI                          
ATOM   4248  H   HOH     1     -27.809   4.855 -27.803  0.00  0.00      HATI                          
ATOM   4249  O   HOH     1     -27.550   4.020 -27.318  0.00  0.00      HATI                          
ATOM   4250  H   HOH     1     -19.010   5.493  44.300  0.00  0.00      HATI                          
ATOM   4251  H   HOH     1     -18.958   5.838  45.896  1.00  0.00      HATI                          
ATOM   4252  O   HOH     1     -19.119   6.215  44.984  0.00  0.00      HATI                          
ATOM   4253  H   HOH     1      12.951   6.641  28.351  1.00  0.00      HATI                          
ATOM   4254  H   HOH     1      13.590   7.926  29.130  0.00  0.00      HATI                          
ATOM   4255  O   HOH     1      13.304   6.973  29.226  0.00  0.00      HATI

That specifically is more complicated. VMD and Pymol recognize that only because they compute the connectivity from the distances and deduce that there are 3 residues there. That I think is beyond what is expected for a reading package. Also, Amber, for instance, just requires TER to be added after each molecule, and ignores everything else.

jgreener64 commented 1 month ago

as many of them do not have any protein whatsoever

The name ProteinStructure is an early design decision I have long regretted. Perhaps if we do a breaking release for https://github.com/BioJulia/BioStructures.jl/issues/49 we can rename it to MolecularStructure or something. Hopefully it shouldn't break too much code, and it should be easy enough to search and make PRs for in other repos.

timholy commented 1 month ago

But the "residue number", or "name", "chain", or any other property, are just meaningless labels that help one to draw or select different parts of the structure... Thus, if there was an option to just flatten the hierarchy to have "residues" at the top level, the MD files could be parsed without major issues

I think I begin to see your point. You're basically saying that it's OK with you if all the waters use the same label and you can't select them individually, right? That's a valid perspective if either (1) you don't actually care about bonds, or (2) there's an external program to assign bonds. I'd guess that few people are actually in camp 1, but camp 2 is presumably common. Is there any ambiguity, though, about assigning bonds simply by distance? If so, then the reader is still partially responsible for indicating hierarchy, like via the TER that Amber requires.

So to me it seems that some kind of renumbering, based on an unambiguous termination indicator, might be a reasonable solution?

lmiq commented 1 month ago

Bonds (or more generally the topology of the molecules) are defined in different files in MD simulations. In the PDB files they can be written with the CONECT keyword, at the end, but most commonly (never? not sure) these are not used.

There are certainly problems in assigning bonds based on distances. Sometimes atoms from different molecules are too close and that causes errors. VMD throws erros/warnings in these cases, when it detects that some atom appears to have more bonds that allowed. But assigning bonds based on distances is useful but only a workaround when the topology files are not provided.

I think I begin to see your point. You're basically saying that it's OK with you if all the waters use the same label and you can't select them individually, right?

You could still select one individually by a residue counter (which might or not match the "residue number" written in the PDB file), exactly how it happens when such files are read by MIToS), where the residue counter is just the index of that residue in the residue array that results from reading.

So to me it seems that some kind of renumbering, based on an unambiguous termination indicator, might be a reasonable solution?

The packages dissociate the residue and atom counts from the "residue number" and "atom index" as written in the PDB files. What we do, which I think is a general enough solution, is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started.

timholy commented 1 month ago

I'd favor strict-by-default and requiring that you pass renumber=true to activate that mode, but renumbering seems like a sensible solution.

timholy commented 1 month ago

is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started.

Just to check, the case in https://github.com/BioJulia/BioStructures.jl/issues/48#issuecomment-2129597753 doesn't fit that description, does it? Are you wishing that would work anyway or are you OK with requiring some kind of termination signal?

jgreener64 commented 1 month ago

What we do, which I think is a general enough solution, is that the residue counter is increased whenever any of the residue labels change: residue number, chain, residue name, segment. If any of those change, we assume that a new residue started.

I'd favor strict-by-default and requiring that you pass renumber=true to activate that mode, but renumbering seems like a sensible solution.

This is what I had in mind with renumber_residues, I can look into making the change. Having different molecules in the same chain is quite common in the PDB, e.g. for waters.

Note, this would mean you can't write the original PDB file back out as we would discard the residue numbers in the file.

lmiq commented 1 month ago

This is what I had in mind with renumber_residues, I can look into making the change. Having different molecules in the same chain is quite common in the PDB, e.g. for waters.

Uhm... I don't think residue renumbering is the correct approach. We do not want residue numbers to be changed when reading protein residues, for example.

What is necessary is to dissociate the residue counter from the residue number as written in the PDB file. (And I would say also dissociate the atom counter from the atom index as written in the PDB file). Note: VMD has different attributes for each: resnum and resid. In PDBTools I also have that, and also added pdb_index for the atom index as written in the PDB file.

I see now that the residues are stored in a Dict in BioStructures, which makes things slightly harder. If it was an ordered Dict, the index of the residue in the ordered Dict would be the residue counter.

I wander if having the residues of each chain in a vector wouldn't be more appropriate - I'm not very comfortable with the order of the residues in the file being meaningless. For instance this is how the residues of h6n6 appear:

julia> pdb[1].chains["A"].residues
Dict{String, BioStructures.AbstractResidue} with 282 entries:
  "407" => Residue 407:A with name LYS, 9 atoms
  "371" => Residue 371:A with name ARG, 11 atoms
  "447" => Residue 447:A with name ILE, 8 atoms
  "335" => Residue 335:A with name ASN, 8 atoms

not having them in order is somewhat strange.

lmiq commented 1 month ago

Just changing the dicts to OrderedDict provides:

julia> pdb[1].chains["A"].residues
OrderedDict{String, AbstractResidue} with 282 entries:
  "225" => Residue 225:A with name SER, 6 atoms
  "226" => Residue 226:A with name ALA, 5 atoms
  "227" => Residue 227:A with name ASN, 8 atoms
  "228" => Residue 228:A with name GLU, 9 atoms

which then, if read with the "new residue approach", would implicitly store both the counter and the residue numer. But I would probably prefer having a ResidueVector type with a overloaded getindex function to retrieve residue["225"] when appropriate.

Or having to explicitly use something like residue[ResID(225)], as having the residue number as a string is already not ideal.

Just to check, the case in #48 (comment) doesn't fit that description, does it? Are you wishing that would work anyway or are you OK with requiring some kind of termination signal?

@timholy sorry, missed this comment. Yes, that case I think we can leave out. Maybe support TER after each molecule as workaround, as an important MD package (Amber) uses that, but it would be very specific (I had to support that in Packmol, nevertheless).

jgreener64 commented 1 month ago

Am I understanding right that the key in the OrderedDict case would be the residue number from the file, and the ordering is based on the residue counter? How would this work in the case of multiple residues with the same residue number in the file (since that would be multiple values with the same key)?

I guess this could be solved with a vector type, but then what should be returned when you index into it with "225", considering that there might be multiple residues with that residue number in the file?

I'm not very comfortable with the order of the residues in the file being meaningless

The order is not stored directly in the object, but the residues are sorted when writing out. This means that different file representations of the same underlying molecule get written out to the same file, but you do lose the ordering of the input file. In the usual case of ascending residue numbers in the input file (possibly with gaps), this is preserved in the output file.

timholy commented 1 month ago

We do not want residue numbers to be changed when reading protein residues, for example.

I think @jgreener64's point is that this admirable goal makes things harder---if you only increment the counter when something is in conflict, what happens if a later structure in the file uses the number you "stole" in order to do the increment? Two possible solutions are (1) to parse the file twice to identify all conflicts in advance, or (2) transiently distinguish "unconflicted" ids and "conflicted" ids and then assign final ids after all unconflicted ones have been assigned. While I proposed (1) above, maybe (2) is the better choice.

I see now that the residues are stored in a Dict in BioStructures, which makes things slightly harder. If it was an ordered Dict, the index of the residue in the ordered Dict would be the residue counter.

I've learned that there's a separate field that gives the order of the keys, and you can access the dict with integer indexing as dict[keyorder[i]]. So effectively there is an OrderedDict even if it isn't represented that way. I'm not sure I fully understand how disordered locations (when you have more than one location for a residue or atom) interact that way, though, which I think is the main purpose behind using strings as residue keys.

@timholy...yes, that case I think we can leave out.

👍

lmiq commented 1 month ago

I think I need to give a step back and explain how we use the data here.

From our perspective, all fields of the PDB are just labels. Apart from the data of the fields, in the process of reading the file, one annotates, incrementally, an independent residue counter and an independent atom index counter.

At the end, we have a vector of atoms where each atom carries all that information:

julia> pdb = PDBTools.readPDB("/home/leandro/Downloads/6hn6.pdb")
   Array{Atoms,1} with 2090 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     SER     A      225        1   43.004   80.351   76.389  1.00 118.26     1       -         1
       2   CA     SER     A      225        1   42.216   81.220   75.509  1.00 116.77     1       -         2
       3    C     SER     A      225        1   42.952   81.518   74.177  1.00 118.83     1       -         3
                                                       ⋮ 
    2088    O     HOH     A      638      306   23.920   74.193   77.532  1.00 55.18     1       -      2089
    2089    O     HOH     A      639      307    6.395   85.305   51.528  1.00 104.62     1       -      2090
    2090    O     HOH     A      640      308   19.024   61.159   68.618  1.00 64.07     1       -      2091

There, for instance, index is the incremental atom counter, and residue is the incremental residue counter, while index_pdb is the content of the index field in the PDB file (rarely used) and resnum is the content of the residue number field in the PDB file.

Then, working with such a data structure consists in applying filters, which return vectors of atoms, or indexes:

julia> filter(sel"resnum < 300", pdb)
   Array{Atoms,1} with 587 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     SER     A      225        1   43.004   80.351   76.389  1.00 118.26     1       -         1
       2   CA     SER     A      225        1   42.216   81.220   75.509  1.00 116.77     1       -         2
       3    C     SER     A      225        1   42.952   81.518   74.177  1.00 118.83     1       -         3
                                                       ⋮ 
     585  CG1     ILE     A      299       75   23.370   60.986   68.597  1.00 44.83     1       -       585
     586  CG2     ILE     A      299       75   23.842   62.784   70.389  1.00 46.63     1       -       586
     587  CD1     ILE     A      299       75   24.450   60.003   68.979  1.00 49.51     1       -       587

julia> findall(sel"residue = 1", pdb)
6-element Vector{Int64}:
 1
 2
 3
 4
 5
 6

We do not need these filtering operations to be particularly fast, or lazy. But we do need to be able to select subsets of the structure with great versatility, potentially using incremental or pdb-defined residue numbers (very common), or any other atom property.

That data structure doesn't prevent us from iterating over residues, with an appropriate iterator:

julia> for res in eachresidue(filter(sel"residue <= 2", pdb))
           println(resname(res)," ",resnum(res))
       end
SER 225
ALA 226

and we could (but didn´t yet) define similar iterators for models, chains, etc, of course. We just don't use those as often.

These way dealing with the data does not have restrictions about duplicate residues, residue numbers, etc. If two residues have the same data, they will just be filtered together, as we would expect them to. Still we can differentiate them by their incremental residue counter and incremental atom indices. There are no conflicts when reading the data, and no special issues associated to repeated fields.

These way of storing and using the data is convenient for us. And I think that we must understand why or how it is not convenient for other people, to justify having different data structures, and to which extent.

For us, using a syntax like pdb["A"].residues["247"].atoms[1] is not very useful nor practical, since most of the time we are selecting subsets of atoms that may belong to different chains, residues, or molecules, like element O and water, or protein and backbone. Having the underlying data in a tree does not really help much.

timholy commented 1 month ago

We've also gotten a lot of benefit from random-access indexing. For example we might pick out all the positively-charged residues and examine their spatial distribution. We also compute displacement vectors from the alpha carbon to the side chain center-of-mass to determine whether residues in a 7TM are "interior" or "exterior." And so on.

I don't think the pdb["A"] portion of this will get in our way, but indeed when it gets down to the residue level we need utilities that seem missing from BioStructures. I'm optimistic all that is resolvable but it may take some careful design. And I agree that it can't successfully be a platform for the whole community unless we find ways to make that kind of thing easy.

That said, I think BioStructures is doing something important: it focuses on structural representation rather than file format, and it can get multiple file formats into that same representation. It also seems to take the complexities of that process seriously. It's why I'm prepared to do a fair amount of work to port our own code over to BioStructures, if the technical issues (convenience, comprehensiveness, speed, heaviness of dependency, etc) can all be resolved. I for one am quite optimistic that this is the case. And if not, I think the goal is worthy even if we have to have a fresh start. Fundamentally, the current fragmentation of the representations for structures is very detrimental to the long-term health of the Julia ecosystem for questions that relate to structural biology.

I have a conference deadline in a few weeks, so I am not sure how much I can do between now and then, but this is something I'm willing to put some work into to help make it happen.

jgreener64 commented 1 month ago

The PDBTools.jl approach is certainly appealing and self-consistent. You can access similar list functionality in BioStructures using collectatoms, collectresidues and selector functions of course, though selection is currently done via functions rather than a selection syntax. For example, something like collectresidues(struc, r -> resnumber(r) < 300) works similar to your filter(sel"resnum < 300", pdb) case. In general the underlying structure should not hinder this type of analysis, and if it does that is a bug.

when it gets down to the residue level we need utilities that seem missing from BioStructures

Do let me know of any calculations that BioStructures can't do here, they should be possible to add.

The limitation of the tree data structure is then if it restricts you from doing something, which here is adding two atoms with the same labels.

From our perspective, all fields of the PDB are just labels.

This, I think, highlights the difference in philosophy between the packages. BioStructures mirrors the PDB in not treating the fields as labels, but as semantically-relevant information that can lead to ambiguity. In those cases we error deliberately, so in a sense the inability of the data structure to represent this is a "feature not a bug", though I see why in your case it is frustrating.

I did have a quick look at other parsers and it seems they are varied. ProDy adopts a list-of-atoms approach and reads in duplicates. Biopython, probably the most widely-used parser in Python, ignores the second case in permissive mode (which seems like the worst of options) and errors on duplicates in strict mode. BioStructures hence mirrors the Biopython strict mode, which is not surprising given that the representation was based on Biopython.

It makes sense that visualisation packages would be more permissive, since you see what you open and they are often used for checking files of unknown validity. But for something like looping over all files in the PDB and doing some spatial analysis, which is a use case for BioStructures, you would probably want to guarantee that each atom appears only once (aside from disorder), hence why disallowing duplicates has not seemed a problem to me before.

lmiq commented 1 month ago

you would probably want to guarantee that each atom appears only once (aside from disorder), hence why disallowing duplicates has not seemed a problem to me before.

This is a crucial point I think. PDB files accept multiple conformations of the same residue and have a specific notation to it. I don't think having a format that inherently is incapable of representing such is appropriate for a general structure representation package. In MD we also have multiple topologies (two residues in the same position for performing FEP of mutations). I think we should have an underlying format able to represent all these possibilities.

For looping over PDB files requiring some more restricted data set, we should have explicit reading options or post-reading filtering or checking functions (filter!(!is_alternate_conformation, structure)).

What I meant from the previous post is that we should focus first in what is the utility of the packages before deciding the underlying format, and I wanted to understand if struct["A"].residue["274"].atoms[1] is something useful to justify the possible limitations of the representation that such a data structure brings.

timholy commented 1 month ago

What I meant from the previous post is that we should focus first in what is the utility of the packages before deciding the underlying format, and I wanted to understand if struct["A"].residue["274"].atoms[1] is something useful

Overall, this discussion seems a bit like a struct-vs-DataFrame debate. If you have a set of objects with containerization relationships, then x.y[2].z[3] is equivalent to the DataFrames-world x[:y => ==(2), :z => ==(3)]. Both are viable ways to get your work done. Some things might seem easier or more natural depending on

what the operation is
what your previous experience is
what utility functions are available

The last, of course, is something that can be enhanced any time there is an unmet need. But I don't think that all users would agree that x[:y => ==(2), :z => ==(3)] is a lot more natural than x.y[2].z[3].

I think this analogy is useful because it highlights that the three packages seem to pick different levels for their default representation:

BioStructures: fully embraces the struct perspective
PDBTools: fully embraces the DataFrame perspective, as it flattens everything down to the level of atoms and uses "columns" to indicate containerization relationships.
MIToS: hybrid. Selects the chain-level at reading time, and thus you only have one chain in a single variable. It then flattens DataFrame-like at the level of residues, but each residue is a struct.

This diversity highlights the fact that this choice is a bit arbitrary. To me that's a strong argument that there are many ways we can all get our work done, and that the most important thing to do is unify around a single representation, regardless of whether it's hierarchical or flat. We need a standard representation to build a large, high-quality stack in this space.

to justify the possible limitations of the representation that such a data structure brings

I'd propose that we distinguish the discussion about "hierarchical or flat" from the "strict vs permissive" discussion. While a Dict can only associate one value per key, if you really need to represent "both values" assigned to 'a' in

Dict('a' => 1, 'a' => 2)

you can always use

Dict('a' => [1, 2])

as long as the rest of the code knows that the value field should be interpreted as a container of values.

lmiq commented 1 month ago

fully embraces the struct perspective

Sort of, because PDBs have also "model" and "segment" identifiers. There is a lot of arbitrariness in the way the PDB format classifies parts of the structures. "chain" is a synonym of "molecule" from a chemical point of view, but if that was taken to heart then water molecules should split on different chains.

IMHO the underlying representation of molecular structures should be at the "molecule" level, and then flat at the atom level. A "molecule" is effectively meaningful, but even the content of the molecule is subject to ambiguities (as multiple conformations or incomplete atomic positions).

All the remaining information that a PDB (or other format) contains are just arbitrary annotations, which I think can only be represented generally enough if at the atom level with custom fields.

If we had an underlying molecular structure format that represented molecules where atoms have optional fields, that could effectively be useful for interoperability of the various molecular-structure packages. Anticipating the possibility of storing the topology of the molecule in the same format is probably a must.

ps: The only real constraint we have on the underlying format is concerning performance for specific operations. I think the choice of one specific format could be limited in view if we identify an important and common operation that needs to be very fast on very large data sets. But probably in those cases intermediate representations are needed anyway.

timholy commented 1 month ago

CC @anton083. This discussion started on Slack (I invited @murrellb but a search didn't turn up a user handle for you), but most of the content is here now. The bottom line is that I'm trying to build momentum around the idea that currently we have 3 main packages for reading PDBs and that we really should settle on just one. This productive discussion is mostly about identifying any barriers that currently prevent this, and what can be done to fix them.

timholy commented 1 month ago

if that was taken to heart then water molecules should split on different chains.

Personally that seems like a good idea to me, if the one-character limit of PDB files weren't a constraint. Of course you'd want tools to "find/discard all waters" but that's basically sugar.

IMHO the underlying representation of molecular structures should be at the "molecule" level, and then flat at the atom level. A "molecule" is effectively meaningful, but even the content of the molecule is subject to ambiguities (as multiple conformations or incomplete atomic positions).

Conceptually there's a lot of value in that idea---you're right that Residue is really awkward for things like, e.g., ligands which are unrelated to amino acids. I guess the only real value of Residue (or an equivalent "column in the dataframe") is that a flat list of atoms without annotation does not allow you to infer bonds, whereas a list of Residues does let you implicitly encode bonds since (1) proteins are (mostly) made up of specific building blocks, and (2) these blocks are connected linearly. That fails, obviously, for things that aren't proteins. So I agree that CO₂ (a non-AA chain) might be more fairly represented as

3×4 DataFrame
 Row │ id     symbol  residue  chain
     │ Int64  String  Missing  Int64
─────┼───────────────────────────────
   1 │     1  C       missing      1
   2 │     2  O       missing      1
   3 │     3  O       missing      1

than via

struct Chain
    residues::Vector{Residue}
end

struct Residue
    atoms::Vector{Atom}
end

For a hierarchical representation, we could fix that with

struct Chain
    residues::Union{Vector{Residue},Vector{Atom}}
end

or

struct Chain
    residues::Union{Vector{Residue}, Nothing}   # specify either (not both!)
    atoms::Union{Vector{Atom}, Nothing}
end

but presumably that would be viable.

The only real constraint we have on the underlying format is concerning performance for specific operations.

This is one place where hierarchical representations have advantages. If you have a completely flat representation, say with 20 chains, each with 300 residues, each with ~8 atoms, then any time you select anything from the "dataframe" you need to traverse 20 300 8 = 48000 rows. Whereas with a hierarchical representation, if you only need "chain D" then you immediately reduce down to 300 residues regardless of whether any other efficiencies are possible. Of course, hybrid strategies achieve hybrid performance.

As an example, one thing we do in our analyses all the time is compute the center-of-mass of the each residue side-chain. This is easy and efficient with a hierarchical representation. With a flat representation of atoms, you'd need the equivalent of DataFrames' groupby to make that efficient---it's an O(n) operation (n = number of atoms) rather than the naive O(n*m) (m = number of residues) if you did each residue as a separate query on the full list of atoms.

lmiq commented 1 month ago

Personally that seems like a good idea to me, if the one-character limit of PDB files weren't a constraint.

The problem here is that this not up to us to decide. PDBs are already used in which multiple molecules belong to the same chain.

If for generality, I would rather use

struct Molecule
    atoms::Vector{Atom}
    bonds::Vector{Tuple{Int,Int}}
end

where Atom is a data structure with a dictionary of potentially custom fields, including chain identifiers, etc. I would then rely on higher-level functions to deal with all possible variations and subtleties of how these fields can be used.

As an example, one thing we do in our analyses all the time is compute the center-of-mass of the each residue side-chain.

Yes, those things can be easier with the hierarchical approach. But, is that performance really important? And, at the same time, with the flat approach that specifically can also be done in O(n) (just mentioning because I think it is important to identify actual applications where one or other format is effectively limiting):

julia> using PDBTools, StaticArrays

julia> cm_side_chains = SVector{3,Float64}[]
       for residue in eachresidue(atoms) # lazy
           side_chain = select(residue, by = issidechain) # select side chain atoms of this residue 
           length(side_chain) > 0 && push!(cm_side_chains, center_of_mass(side_chain))
       end

(yes, if I needed the CM of one specific residue, I still have to traverse the array).

timholy commented 1 month ago

The problem here is that this not up to us to decide.

How we represent structures is/can be distinct from how they are encoded in the file. Several of our posts above concern renumbering schemes that ignore one or more tags in the file.

To the extent that Chain is a synonym for "molecule" (is it? I'm unsure), then obviously in reality each water molecule is its own chain.

struct Molecule

That's getting a bit into MolecularGraph.jl territory. It's an excellent package and we use it heavily.

with the flat approach that specifically can also be done in O(n)

Yes, that's basically what I meant by "groupby"; if your code example is meant to run in O(n), then presumably eachresidue makes a single pass to first collect atoms that are together in the same residue, or it assumes that atoms are listed in order by residue. The first is safe but impossible to do in a lazy manner; the latter is dangerous if the atoms are stored in a representation like Vector which permits reordering. You can do it safely without intermediate collection if you do something like

accum = [zeros(3) for _ = 1:nresidues]
count = fill(0, nresidues)
for atom in list
    idx = residueindex(atom)
    accum[idx] += atom.coords
    count[idx] += 1
end
centerofmass = accum ./ count

but of course that's a bit algorithm-specific (it works for linear measures on the residue, less obviously well for nonlinear measures).

lmiq commented 1 month ago

To the extent that Chain is a synonym for "molecule" (is it? I'm unsure), then obviously in reality each water molecule is its own chain.

It is because "chain" is thought as "polymer chain", which make sense for proteins or nucleic acids, or other polymers. Calling a water molecule a "chain" would be an abuse of notation, though.

Yet, a molecule could be a vector of general "ChemicalUnit"s, for which a single molecule is a one, or a polymer residue is also one.

struct Molecule
    units::Vector{ChemicalUnit}
end
struct ChemicalUnit
    atoms::Vector{Atom}
end

then a water molecule would have a single chemical unit, a polymer could have many and map what we call "chain" in the specific chain of PDB files.

Just throwing an idea, I'm not claiming "ChemicalUnit" not be necessarily a good name.

jgreener64 commented 1 month ago

I wanted to understand if struct["A"].residue["274"].atoms[1] is something useful to justify the possible limitations of the representation that such a data structure brings.

I find this useful for interactive structural biology applications, but as you say any representation can probably do this by overloading getindex so I wouldn't say it justifies the data structure by itself.

PDB files accept multiple conformations of the same residue and have a specific notation to it. I don't think having a format that inherently is incapable of representing such is appropriate for a general structure representation package.

I think we are all aware of this but to be clear, BioStructures allows the 3 types of multiple conformations that appear in the PDB: alternative atom (representing ambiguous experimental determination), alternative residue (ambiguous experimental determination or mutation) and MODEL blocks (representing NMR models or MD trajectories). What it doesn't allow is multiple atoms with the same label, because the only way to distinguish these is by the order in the file, and this is a property of a file not the underlying molecule.

I'd propose that we distinguish the discussion about "hierarchical or flat" from the "strict vs permissive" discussion.

I agree with this, I am suggesting that even if the data structure allows it we should think very carefully before being permissive in this sense.

Dict('a' => [1, 2])

We could think about making this change, but would have to think about how it plays with collectatoms etc., I guess we would expand the repeats when a flag like expand_duplicates is given.

If we had an underlying molecular structure format that represented molecules where atoms have optional fields, that could effectively be useful for interoperability of the various molecular-structure packages.

This is exactly the aim of AtomsBase.jl, though it hasn't taken off for the biomolecular case yet. They have some readers and writers too. It would be nice to have support for that in BioStructures.

IMHO the underlying representation of molecular structures should be at the "molecule" level, and then flat at the atom level

Residue-level is also natural and useful for biopolymers, for example for assigning parameters for a molecular dynamics simulation which are parameterised by residue.

lmiq commented 1 month ago

What it doesn't allow is multiple atoms with the same label, because the only way to distinguish these is by the order in the file, and this is a property of a file not the underlying molecule

But what's the label then? Two identical atoms belonging to different conformations of the same residue do not have a clear way to be distinguished by a label in the PDB. If we need to invent a label, I actually find more useful to just stick with the index of the atom in the "file" (or more generally, "MolecularSystem"), which in fact is a useful index.

timholy commented 1 month ago

I'm still coming to grips with this myself, but isn't this what alt_loc_id is about? For atoms, that's column 17 in the PDB file, see https://www.cgl.ucsf.edu/chimera/docs/UsersGuide/tutorials/pdbintro.html. In PDBTools I think you bundle that into the residue name, https://github.com/m3g/PDBTools.jl/blob/3e246aec4ecbb2922ab85678404394e3245000a1/src/read_atom.jl#L58

lmiq commented 1 month ago

It is. I meant it is not useful as a unique identifier, unless some malabarism is made.

jgreener64 commented 1 month ago

Why is it not useful? The alt loc ID can be A, B etc. which uniquely identifies the conformation.

As I see it small-scale conformational variability is handled okay. The case we are missing might appear when you have >9999 water molecules, you can't write 10000 as a residue number and so you have some scheme where you reuse residue numbers on the same chain to represent completely distinct molecules.

Another case would be where you have different conformations of the whole system, which could represent a MD trajectory, but MODEL blocks can be used there as written by many MD packages.

lmiq commented 1 month ago

I was just editing my comment above.

Now I could check what BioStructures do, which is have disordered atoms stored with a different type (that was the piece I was missing in terms of using the field for representing the multiple conformations):

julia> s["A"].residues["364"].atoms[" CB "]
DisorderedAtom CB with alt loc IDs A,B

julia> s["A"].residues["364"].atoms[" CB "]['A']
Atom CB with serial 1085, coordinates [25.31, 88.547, 65.041], alt loc ID A

julia> s["A"].residues["364"].atoms[" CB "]['B']
Atom CB with serial 1086, coordinates [25.318, 88.585, 65.034], alt loc ID B

I don't have any objection to that, in principle.

Retrospectively, I think, I didn't use BioStructures for my purposes because it failed reading any molecular simulation PDB files, but that might be just because of the issue here reported to begin with.

If a good solution to that is found, I might test if the typical files I have to handle are parsed and, then, I wouldn't mind moving the reading infrastructure and even the data formats of PDBTools to that. If someone smarter than me can implement a full-featured selection syntax like that of VMD (and which PDBTools implements only partially), I think PDBTools could be retired for good.

9999 water molecules, you can't write 10000 as a residue number

PDBTools (and VMD) switch to hexadecimal representations when that happens (also for the atom indices).

ps: I'm still uneasy about not using the sequence of atoms as written in the files. While I see the appeal in the fact that that sequence is somewhat arbitrary and non-physical, the fact is that editing these molecular systems is very commonly a back-and-forth lookup on the actual files, and I would rather have a clear correlation between the representation of the molecular system and what is written. I do not see any advantage of not being able to readily access the atoms by their incremental indices in the PDB file. Maybe I could get accustomed to using struct[AtomIndex(1)] or struct[sel"index 1"] (which would be an extension of the PDBTools style).

jgreener64 commented 1 month ago

Agree that a selection syntax would be great. MDAnalysis is another reference point there: https://userguide.mdanalysis.org/stable/selections.html.

PDBTools (and VMD) switch to hexadecimal representations when that happens (also for the atom indices).

We could probably extend to read this in.

I'm still uneasy about not using the sequence of atoms as written in the files

We could store this as an extra field. In general I think it is just the atom serial though, which we do store. The order that residues are read in we don't store.

lmiq commented 1 month ago

9999 water molecules, you can't write 10000 as a residue number

I think it is just the atom serial though, which we do store.

Not necessarily, because the atom indices too frequently overflow the format. VMD just ignores them and counts the atoms sequentially. PDBTools stores the serial numbers of the PDB in the pdb_index field, which is effectively rarely used.

Concerning levels of classification: there is the "segment" level as well, which is actually important. For instance, we are dealing now with a virus particle. Viruses have a lot of symmetry, and the deposited PDB has only a minimal unit that has to be replicated to compose the complete virus capsid. Within that minimal unit there are of course different chains. Upon replication of the unit, we have then multiple repeated chains, and we do not want to change the name of these chains, because it is inconvenient for many sorts of analysis. The different replicas of the minimal unit are then differentiated by the their "segment names". Thus, in this case, we have, in the same MODEL, many repeated CHAIN identifiers, with repeated fields for everything else, except the "segment" name.

timholy commented 1 month ago

Awesome, @lmiq. I'll be happy to help out once I get past my conference deadline.

Regarding atom order, one thing to think about is the overhead of allocating a Vector{Float64} for each atom xyz coordinate. One alternative would be to store locations at the Chain level in a 3xn matrix, and have each atom refer to the appropriate column in that matrix. That might fit nicely with the representation you're requesting. But the performance implications would have to be checked carefully.

timholy commented 1 month ago

Not necessarily, because the atom indices too frequently overflow the format.

It is kind of amazing that folks (including me!) haven't adopted mmCIF, because these are the kinds of problems it was designed to solve. In my own case, the small step from "PDB" the database to "PDB" the fileformat made me assume "oh, PDB is the right format to download." I only just read enough about this in the past few days to realize that I should switch to mmCIF files, and how weird it is that we all still prefer a file format heavily tailored to the era when keyboards only had capital letters, monitors had black backgrounds with glowing green fonts, and terminals were fixed at 80-character width. It's a little like the story about how the gap between two horses pulling a chariot ended up affected the design of the space shuttle, because the horse-gap set the wheel spacing, the wheel spacing determined the ruts in the ground and thus the road width, which ended up affecting the dimensions of tunnels in the modern US interstate highway system needed to transport assembled components over long distances.

I wonder if they had named it "PDB+" if it would have wider adoption? There seems to be a lot in a name.

lmiq commented 1 month ago

The alternative to store the coordinates in a matrix or Vector{SVector} is something that may be worth additional thinking. That would allow using the coordinates efficiently. If that comes with some other benefits, even better.

I'm still, though, not convinced that not having an order for the atoms in the complete system, following the input file, is really an advantage. There's the argument that chains or molecules could be randomly written in the original file, but that is something I've never seen. (on the contrary, the order of the chains, heteroatoms, etc, is usually carefully thought in terms of their importance).

Concerning the PDB format, there are objective advantages of having a format for which one can copy and paste chunks of data from one file to the other. (and it worries me a bit the fact that mmCIF can, as far as I understand, have other fields, with other levels of user-defined molecular architecture organization, which probably will appear as the systems become larger and larger - and a long-lasting reading format should probably be able to adapt to that).

BioJulia / BioStructures.jl

failing to read PDB files generated by VMD #48