cannot pickle universe object

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Create a universe object by passing a psf file and a dcd file
          u = universe(psf, dcd)
2. Try to serialize this object by using python pickle module
          u_blob = pickle.dumps(u)

What is the expected output? What do you see instead?
Expected output should be a byte stream, but I got error message 'TypeError: 
'AtomGroup' object is not callable'

What version of the product are you using? On what operating system?

MDAnalysis 0.8.0
Python 2.7.6 / Python 2.6.2
Arch Linux / CentOS

Please provide any additional information below.

Original issue reported on code.google.com by charlesz...@gmail.com on 27 Mar 2014 at 9:18

GoogleCodeExporter commented 9 years ago

I've certainly brought this kind of thing up in the past, but I think the fact 
that universe objects are closely tied to open files makes it difficult to 
allow proper serialization via pickle. This also makes it challenging to pass 
universe objects between cores in a parallel workflow. I generally end up 
adjusting my workflow to produce something more tractable like a numpy array of 
coordinates and pickle that for storage and / or interprocess communication.

Original comment by tyler.je.reddy@gmail.com on 27 Mar 2014 at 10:14

GoogleCodeExporter commented 9 years ago

The way Universes are built at the moment makes it impossible to pickle them as 
they contain trajectory reader objects which in turn contain open file 
descriptors.

This won't change any time soon unless someone comes up with a smart way to do 
this. Therefore I am closing this with 'WontFix' – but feel free to start a 
discussion on the developer mailing list or in the comments to this issue. If a 
sensible approach and consensus emerges we will reopen.

Oliver

Original comment by orbeckst on 27 Mar 2014 at 10:16

GoogleCodeExporter commented 9 years ago

Original comment by orbeckst on 27 Mar 2014 at 10:16

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

Just to clarify the cryptic error message:

pickle checks whether a class has a __getstate__ function, and then executes it 
if it does. The way this is done is a sort of duck-typing, where 
object.__getstate__ is assigned to a variable, which is subsequently called.
If there is no __getstate__ an AttributeError is raised, in which case pickle's 
default behavior ensues.

The thing with some MDAnalysis onjects is that they manage their attributes and 
never raise an AttributeError. In particular if you try AtomGroup.__getstate__ 
you get a SelectionError which pickle does not handle; if you do the same with 
Segment.__getstate__ you get no error (!!) and an empty AtomGroup is returned. 
It is this last case that causes the 'TypeError: 'AtomGroup' object is not 
callable' when pickle tries to execute what it got for Segment.__getstate__.

This is a dangerous side-effect of the syntactic sugar for selection shortcuts. 
A lot of things can go silently ignored.

On the topic of parallelization, I'll post soon some code I have geared 
specifically for parallelizing trajectory reads. The serialization of 
MDAnalysis objects only becomes a problem if they are to be passed back and 
forth between workers. If the code avoids that (and takes care of renewing file 
descriptors) multiprocessing works fine.

Original comment by manuel.n...@gmail.com on 27 Mar 2014 at 10:29

GoogleCodeExporter commented 9 years ago

Is that possible to extract all the fields of the universe object, and 
serialize them to byte stream. When we wanna regenerate this object, we 
deserialize all the fields and construct the object.

Original comment by charlesz...@gmail.com on 27 Mar 2014 at 11:18

GoogleCodeExporter commented 9 years ago

That is pickle's default approach. MDAnalysis universes are unpicklable in this 
way due to, among other things, open file descriptors and unpicklable function 
objects (SWIG stuff).

This can all be managed instead of leaving pickle to do its default process. 
Look into the __getstate__ and __setstate__ functions. There probably is a way 
of coercing the serialization of enough information for __setstate__ to 
recreate the universe. As far as I went it looked messy and, in my case, not 
worth the trouble.

Original comment by manuel.n...@gmail.com on 27 Mar 2014 at 11:34

GoogleCodeExporter commented 9 years ago

For __getstate__():
* all the arguments of Universe()
* current frame in the trajectory (often that will just be frame 1)
* optional: index of the XTC/TRR reader

For __setstate__():
1) build the Universe again
2) optional: recreate the XTC/TRR reader index
3) go to the saved frame

Original comment by orbeckst on 27 Mar 2014 at 11:42

GoogleCodeExporter commented 9 years ago

I added explicit __setstate__() and __getstate__() methods to Universe and 
AtomGroup which raise a NotImplementedError. This should at least make this 
particular error message clearer (in 0.8.2-dev, commit 
4bb36e64e8d182226c9303777a166bdb56d08b34)

Original comment by orbeckst on 15 Apr 2014 at 8:25

GoogleCodeExporter commented 9 years ago

Maybe it would be worthwhile making pickling work along the lines of recreating 
a copy of the universe using the constructor information and information from 
the TrajectoryReader (which would need its own __getstate__/__setstate__) — 
see comments in this issue for more details.

I reopen as an enhancement and anyone interested can grab the ticket.

Original comment by orbeckst on 15 Apr 2014 at 8:29

Changed state: Accepted
Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

There is a _dcd_c_ptr attribute inside the DCDReader object. It is a C object 
defined in dcd.c, if we wanna serialize the DCDReader object, we have to 
serialize this c pointer as well. But it seems like pickle cannot dumps a 
pyobject.

Original comment by charlesz...@gmail.com on 15 Apr 2014 at 6:54

GoogleCodeExporter commented 9 years ago

I don't think that we should serialize a Reader wholesale but instead provide a 
way to re-instantiate. In this way you only serialize state information such as

- filename
- number of atoms
- current frame
- ... all other parameters set in __init__()

and then essentially create a brand new Reader with this information.

Original comment by orbeckst on 15 Apr 2014 at 7:08

GoogleCodeExporter commented 9 years ago

If we wanna initialize a reader object, we must call the _read_dcd_header() 
inside the initialize function which is implemented by c in dcd.c and it will 
set a new attribute named __dcd_c_ptr which is a PyObject in the reader object. 

This attribute is used by function like MDAnalysis.analysis.align.rmsd(A, B) 
and most other functions. So I think if we wanna serialize DCDReader we will 
have to serialize this attribute or we can just serialize the coordinate and 
topology file for regenerating universe object later.

Original comment by charlesz...@gmail.com on 15 Apr 2014 at 7:54

GoogleCodeExporter commented 9 years ago

Come to think it will be difficult to pickle the exact state of a Universe, 
e.g. if I renamed atoms or changed coordinates. 

What is quite do-able (I think) is to re-create the Universe in the same way as 
it was created in the first place as you just need to know the constructor 
arguments (namely, topology files and coordinate files and any keyword args). 
This might or might not be sufficient for many applications but it certainly is 
not true serialization of the current state of an object.

Original comment by orbeckst on 16 Apr 2014 at 8:53

djay0529 / mdanalysis

cannot pickle universe object #173