biotite-dev / biotite

A comprehensive library for computational molecular biology
https://www.biotite-python.org
BSD 3-Clause "New" or "Revised" License
660 stars 101 forks source link

Multi-model PDB takes `b_factor` from first model #449

Closed BradyAJohnston closed 1 year ago

BradyAJohnston commented 1 year ago

When importing a multi-model pdb file, the b_factor is taken from the first model, for all subsequent models.

Example pdb:

TITLE     MDANALYSIS FRAMES FROM 0, STEP 1: Created by PDBWriter
CRYST1    1.000    1.000    1.000  90.00  90.00  90.00 P 1           1
REMARK     285 UNITARY VALUES FOR THE UNIT CELL AUTOMATICALLY SET
REMARK     285 BY MDANALYSIS PDBWRITER BECAUSE UNIT CELL INFORMATION
REMARK     285 WAS MISSING.
REMARK     285 PROTEIN DATA BANK CONVENTIONS REQUIRE THAT
REMARK     285 CRYST1 RECORD IS INCLUDED, BUT THE VALUES ON
REMARK     285 THIS RECORD ARE MEANINGLESS.
MODEL        1
ATOM      1  O   SOL X   1       0.000   0.000   0.000  1.00  0.17      SOL  O
ATOM      2  H1  SOL X   1       0.000   0.000   0.000  1.00  0.17      SOL  H
ATOM      3  H2  SOL X   1       0.000   0.000   0.000  1.00  0.17      SOL  H
ENDMDL
MODEL        2
ATOM      1  O   SOL X   1       0.000   0.000   0.000  1.00  0.33      SOL  O
ATOM      2  H1  SOL X   1       0.000   0.000   0.000  1.00  0.33      SOL  H
ATOM      3  H2  SOL X   1       0.959   0.959   0.959  1.00  0.33      SOL  H
ENDMDL
MODEL        3
ATOM      1  O   SOL X   1       0.959   0.959  -0.027  1.00  0.50      SOL  O
ATOM      2  H1  SOL X   1      -0.027  -0.027  -0.027  1.00  0.50      SOL  H
ATOM      3  H2  SOL X   1      -0.027   0.032   0.032  1.00  0.50      SOL  H
ENDMDL
MODEL        4
ATOM      1  O   SOL X   1       0.032   0.032   0.032  1.00  0.67      SOL  O
ATOM      2  H1  SOL X   1      -0.280  -0.280  -0.280  1.00  0.67      SOL  H
ATOM      3  H2  SOL X   1      -0.280  -0.280  -0.588  1.00  0.67      SOL  H
ENDMDL
MODEL        5
ATOM      1  O   SOL X   1      -0.588  -0.588  -0.588  1.00  0.83      SOL  O
ATOM      2  H1  SOL X   1      -0.588   0.706   0.706  1.00  0.83      SOL  H
ATOM      3  H2  SOL X   1       0.706   0.706   0.706  1.00  0.83      SOL  H
ENDMDL
END

Each frame outputs array([0.17, 0.17, 0.17]) from the first model.

import biotite.structure.io.pdb as pdb
file = pdb.PDBFile.read(file_path)
mol = pdb.get_structure(file, extra_fields = ['b_factor'])
mol[0].b_factor
mol[1].b_factor
padix-key commented 1 year ago

Thanks for bringing this up. I fear this is a limitation in the core of Biotite: AtomArray and AtomArrayStack objects handle the B-factor as optional annotation array. Therefore, by design, the b_factor is equal for all models which is true for all annotation arrays.

I see two future prospects to fix this:

  1. The quick variant would be a get_b_factor() method in PDBFile, analogous to get_coord(). This would allow relatively fast parsing of B-factors even over multiple models. However, the NumPy array of B-factors would not be integrated into the AtomArray itself but must be shipped separately in this variant.
  2. Adding b_factor as 2D array to AtomArray and AtomArrayStack. However, this would require changes in all file interfaces and some other functions. Furthermore, some of these changes would break backwards compatibility (which is not a hard barrier, as Biotite is still in beta).

A solution to this, that is equivalent to variant 1 but with lower efficiency, would be parsing each model separately and saving the array of B-factors:

file = pdb.PDBFile.read(
    "/home/kunzmann/data/coding/biotite/tests/structure/data/1l2y.pdb"
)
b_factor = []
for model in range(file.get_model_count()):
    atoms = file.get_structure(model=model+1, extra_fields=["b_factor"])
    b_factor.append(atoms.b_factor)
b_factor = np.stack(b_factor)

I tend to variant 1, as the B-factor is originally a property of crystal structures (although it has been exploited by a range of programs to store other data as well). As crystal structures in almost all cases have only a single model and multi-model (e.g. NMR) structures have no meaningful B-factor, there is little use for different B-factors along multiple models. Therefore, I think handling a 2D array of B-factors separate from the AtomArrayStack should be sufficient in such special cases. Alternatively, a list of AtomArray objects for each model could be used instead at the cost of some performance and convenience.

What is your opinion on this? Would using a separate 2D array work for your use case of atom coloring on a per-model basis?

BradyAJohnston commented 1 year ago

In my experience the only time a multi-model PDB or mmCIF contains differing b_factor values is when people are hijacking it for storing arbitrary data computed from something else, so it's understandably not an integrated part of the AtomArray.

For my own purposes, handling a separate 2D array would be totally fin. I can index it when creating the models, and I currently only need to actually have access to it one time really. The code solution you posted as well would also be useful and I could certainly implement it in the mean time if you were to plan on adding the get_b_factor() method later on.