Minimum required information on Atoms

GoogleCodeExporter commented 9 years ago

Going through all the topology readers made me notice that most formats provide 
very different data, and MDA expects quite a lot of stuff.

The init for Atom currently looks like:

 def __init__(self, number, name, type, resname, resid, segid, mass, charge,
residue=None, segment=None, radius=None, bfactor=None,
resnum=None, serial=None, altLoc=None):

So required fields are:
 * number
 * name
 * type
 * resname
 * resid
 * segid
 * mass
 * charge

It would be good to reduce the required fields, as far as I can see, the only 
really really, required field is the number?  If we only populate atoms with 
what is provided, and make any secondary data structures (fragments, residues, 
bonds) lazily built, it should make things more lightweight in many cases.

To change this, if we make Atom raise NoDataErrors, and then anything that 
wants something (eg a Writer) can anticipate a NoDataError and fudge the value 
if necessary?  Maybe raising good warnings too, eg, "I wrote you a pdb but 
there were no masses, you can add masses with: atoms.masses =  ".

This is a pretty big change, so will likely get split up into many little 
issues.

Thoughts?

Original issue reported on code.google.com by richardjgowers on 12 Feb 2015 at 9:21

Blocked on: #202

GoogleCodeExporter commented 9 years ago

Is it currently a problem that Atom wants to know too much and is setting up 
the secondary data structures (fragments, residues, bonds) currently a real 
bottleneck? Do we have timing data on this?

My impression is that the first priority in terms of speed would be to 
accelerate the topology readers themselves (together with the GRO/PDB/CRD 
readers). Only then spend some effort on a major rewrite of the inner core.

I'm generally on the side of "choose appropriate default values", e.g. assign 
segid "SYSTEM" if nothing else known, and guess masses and charge when possible 
(of course, more difficult for coarse grained force fields). Therefore, I'd 
insert appropriate default values when building the topology and not when 
writing out. That makes it safer for analysis code, which then doesn't have to 
deal with NoDataErrors. We should, however, log a bunch of warnings during 
topology generation for anything that we guess.

That's my opinion and I am curious to hear what others have to say.

Original comment by orbeckst on 13 Feb 2015 at 4:42

GoogleCodeExporter commented 9 years ago

Bonds & Fragments are already lazily built, I've got Residues & Segments being 
lazily built on my experimental features-residues branch which I was going to 
move across to develop after 0.9.  It shaves a little time off loading 
Universe, but I agree that the real bottleneck is loading the ASCII files.

Original comment by richardjgowers on 13 Feb 2015 at 9:43

djay0529 / mdanalysis

Minimum required information on Atoms #215