default vstack in _conclude() fails with scalar per frame data

orbeckst commented 5 years ago

Expected behaviour

I can just follow the typical example to make analysis from a single frame function such as

    import MDAnalysis as mda
    import MDAnalysisData
    data = MDAnalysisData.datasets.fetch_nhaa_equilibrium()

    u = mda.Universe(data.topology, data.trajectory)
    protein = u.select_atoms('protein')

    def rgyr(ag):
        return  ag.radius_of_gyration()

    import pmda.custom
    parallel_rgyr = pmda.custom.AnalysisFromFunction(rgyr, u, protein)

    parallel_rgyr.run(n_jobs=4, n_blocks=4)
    print(parallel_rgyr.results)

Actual behaviour

The above fails with

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-30-27c1019c5274> in <module>
----> 1 parallel_rgyr.run(n_blocks=4)

~/MDAnalysis/pmda/pmda/parallel.py in run(self, start, stop, step, n_jobs, n_blocks)
    374             with timeit() as conclude:
    375                 self._results = np.asarray([el[0] for el in res])
--> 376                 self._conclude()
    377 
    378         self.timing = Timing(

~/MDAnalysis/pmda/pmda/custom.py in _conclude(self)
    101 
    102     def _conclude(self):
--> 103         self.results = np.vstack(self._results)
    104 
    105 

~/anaconda3/envs/pmda/lib/python3.6/site-packages/numpy/core/shape_base.py in vstack(tup)
    281     """
    282     _warn_for_nonsequence(tup)
--> 283     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    284 
    285 

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Code to reproduce the behaviour

See above.

The problem is that the arrays in _results do not have the same lengths

>>> [len(a) for a in parallel_rgyr._results]
[1251, 1250, 1250, 1250]

and np.vstack() does not like this. Instead one would need np.hstack(). However, as soon as the return value of the function is a tuple, vstack() works as expected.

Currently version of MDAnalysis:

PMDA 0.2.1
MDAnalysis 0.19.2
dask 1.2.2

orbeckst commented 5 years ago

@VOD555 @kain88-de @richardjgowers suggestions welcome; see PR #88...

VOD555 commented 5 years ago

@orbeckst vstack cannot deal with array whose size is [1251, 1250, 1250, 1250]. It works for array with shape [(1251, 1), (1250, 1), (1250, 1), (1250, 1)], or [(1251, n), (1250, n), (1250, n), (1250, n)]. For example change the return value to

[ag.universe.trajectory.time, ag.radius_of_gyration()]

Although hstack works for this case, it doesn't work for array with size [(1251, n), (1250, n), (1250, n), (1250, n)].

I suggest we still use vstack, and add some preprocess to solve this problem. For example, reshape _result before _conclude()

if len(np.shape(self._results[0])) == 1:
    n = len(self.results[0])
    results = [a.reshape((n, 1)) for a in self.results]

orbeckst commented 5 years ago

If we do the reshape(ni, 1) (where ni is the number of frames in each block and t is the total number of frames, i.e., the sum of all the ni) then the output array will be shape (t, 1), which is really weird when you thought you'd just get a series. In this case we should then reshape back to (t, ).

Perhaps there's a less convoluted way to achieve the following for n frames:

If the return value of _single_frame() is a single scalar, the output result should be shape (t,)
If the return value is a tuple with m entries then the output result should be shape (t, m).

I am not sure what should happen if _single_frame() returns, e.g., a numpy array of shape (3, 4). Should the result be an array of objects with shape (n,) or a numpy array of shape (t, 3, 4)?

EDIT: made the notation clearer by introducing ni and t (as opposed to using n for everything)

VOD555 commented 5 years ago

I think the result should be a numpy array of shape (t, 3, 4). And this is what vstack does now.

EDIT: n -> t so that logic between comments makes sense

orbeckst commented 5 years ago

Ok, doing "what vstack() does" is a fairly consistent approach. I'd still want a scalar return to generate a simple 1D series, though.

orbeckst commented 5 years ago

Maybe np.concatenate() does what we want by default?

orbeckst commented 5 years ago

In [1]: import numpy as np

In [2]: a1 = np.arange(10)

In [3]: a2 = np.arange(10)

In [4]: a3 = np.arange(11)

In [5]: np.concatenate([a1, a2, a3])
Out[5]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  0,  1,  2,  3,  4,  5,  6,
        7,  8,  9,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [6]: b1 = np.arange(10).reshape(5,2)

In [7]: b2 = np.arange(10).reshape(5,2)

In [8]: b3 = np.arange(12).reshape(6,2)

In [9]: np.concatenate([b1, b2, b3])
Out[9]:
array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

orbeckst commented 5 years ago

Also works for vectors:

In [10]: b1 = np.arange(30).reshape(5,2,3)

In [11]: b3 = np.arange(36).reshape(6,2,3)

In [12]: np.concatenate([b1, b1, b3])
Out[12]:
array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23]],

       [[24, 25, 26],
        [27, 28, 29]],

       [[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23]],

       [[24, 25, 26],
        [27, 28, 29]],

       [[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23]],

       [[24, 25, 26],
        [27, 28, 29]],

       [[30, 31, 32],
        [33, 34, 35]]])

In [13]: np.concatenate([b1, b1, b3]).shape
Out[13]: (16, 2, 3)

MDAnalysis / pmda