mmtf parser concatenates models

MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.

https://mdanalysis.org

Other

1.29k stars 646 forks source link

mmtf parser concatenates models #1496

Closed kain88-de closed 3 weeks ago

kain88-de commented 7 years ago

EDIT by @jbarnoud: The issue actually comes from the handling of models by the mmtf paser. See discussion.

Expected behaviour

both give the same number of segments for 1ubq

Actual behaviour

the mmtf parser gives two segments and the pdb parser gives one

Code to reproduce the behaviour

import MDAnalysis as mda

u = mda.fetch_mmtf('1UBQ')
print(u.atoms.segments)
u = mda.Universe('1ubq.pdb')
print(u.atoms.segments)

....

Currently version of MDAnalysis:

0.17.0dev (couple days old)

kain88-de commented 7 years ago

Behavior is even worse for *1Q0W` where the mmtf parser finds 40 segments while the PDB parser finds only 2.

jbarnoud commented 7 years ago

It seems that the problem is not so much the segments, but the whole content of the files:

import MDAnalysis as mda

u_mmtf = mda.fetch_mmtf('1Q0W')
print(len(u_mmtf.atoms))  # 32420
u_pdb = mda.Universe('/home/jon/Downloads/1q0w.pdb')
print(len(u_pdb.atoms))  # 1621

The MMTF file contains 20 times the number of atoms the PDB file has. Looking at the PDB file, there are 20 models. The issue seems to be that we do not account for models in the same way in the PDB and the MMTF parsers.

richardjgowers commented 7 years ago

Yeah so PDB uses models to differentiate between frames in a trajectory. In MMTF we don't do this, but maybe we should? Ie we could use u.trajectory[i] to select the ith model. I think the problem with this was that models didn't necessarily have the same number of atoms, which breaks our trajectory iteration model.

kain88-de commented 7 years ago

We could default to the first model though.

jbarnoud commented 7 years ago

That is what we do with the PDB parser. Actually, we do not even read after the first model when reading the topology: https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/topology/PDBParser.py#L158.

richardjgowers commented 7 years ago

@jbarnoud for topology purposes, we read the first model and assume that all other models have the same topology, the coordinate reader will iterate (the coordinates) over models

orbeckst commented 7 years ago

One issue is the handling of models, the other the handling of segments or did I misunderstand?

For PDB, segments are parts of a model. Is the same true for mmtf? Are there even segments defined in MMTF?

Assuming that a model broadly means the same thing in both formats then I would also go with the current approach to choose the first model for topology building. I think MMTFParser effectively switches the topology when another model is selected, doesn't it?

richardjgowers commented 7 years ago

@orbeckst the problem is that different models can have a different number of atoms. We currently can't change the topology with frame (model) changes.

The PDB format using models as frames is a special case, as the number of atoms remains constant, which is why we can treat it as a traj.

kain88-de commented 7 years ago

The PDB format using models as frames is a special case, as the number of atoms remains constant, which is why we can treat it as a traj.

That is what most PDBs do. Actually I do not know of a rule in PDB that forbids that models have a different number of atoms. It's just what most people implicitly do, especially since it's abused as a trajectory format.

orbeckst commented 7 years ago

This sounds like a discussion that we already had and where we consulted Alex Rose; I think the conclusion was exactly what Max is saying but it might be worthwhile to look through the issue tracker again.

Ideally we come up with a smart way to deal with changing topologies.

But we also need a short term solution that makes these formats useable in a reasonably consistent way. What are the user expectations here?

Am Aug 22, 2017 um 11:08 schrieb Max Linke notifications@github.com:

The PDB format using models as frames is a special case, as the number of atoms remains constant, which is why we can treat it as a traj.

That is what most PDBs do. Actually I do not know of a rule in PDB that forbids that models have a different number of atoms. It's just what most people implicitly do, especially since it's abused as a trajectory format.

richardjgowers commented 3 weeks ago

probably not relevant now mmtf is going extinct