Applying length 0 trajectory slices in analysis modules

PicoCentauri commented 3 years ago

Expected behavior

Running an analysis module using a start/stop/step combination leading to length 0 trajectory slices should/could raise a meaningful error.

Actual behavior

If an analysis is run over zero frames, inconsistent errors are raised (See below). I know that looping over length zero lists is totally fine in Python. However, running a trajectory analysis over 0 frames is a bit bizarre.

Code to reproduce the behavior

RMSF

from MDAnalysisTests.datafiles import TPR, XTC
from MDAnalysis.analysis.rms import RMSF

u = mda.Universe(TPR, XTC)
rmsf = RMSF(u.atoms)
rmsf.run(10,0,1)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-60-4206b9b8eb79> in <module>
      1 u = mda.Universe(TPR, XTC)
      2 rmsf = RMSF(u.atoms)
----> 3 rmsf.run(10,0,1)

/usr/local/lib/python3.9/site-packages/MDAnalysis/analysis/base.py in run(self, start, stop, step, verbose)
    303             self._single_frame()
    304         logger.info("Finishing up")
--> 305         self._conclude()
    306         return self
    307 

/usr/local/lib/python3.9/site-packages/MDAnalysis/analysis/rms.py in _conclude(self)
    863 
    864     def _conclude(self):
--> 865         k = self._frame_index
    866         self.results.rmsf = np.sqrt(self.sumsquares.sum(axis=1) / (k + 1))
    867 

AttributeError: 'RMSF' object has no attribute '_frame_index'

RDF

from MDAnalysisTests.datafiles import TPR, XTC
from MDAnalysis.analysis.rdf import InterRDF

u = mda.Universe(TPR, XTC)
rdf = InterRDF(u.atoms, u.atoms)
rdf.run(10,0,1)

---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-67-66d4872b0bae> in <module>
      4 u = mda.Universe(TPR, XTC)
      5 rdf = InterRDF(u.atoms, u.atoms)
----> 6 rdf.run(10,0,1)

/usr/local/lib/python3.9/site-packages/MDAnalysis/analysis/base.py in run(self, start, stop, step, verbose)
    303             self._single_frame()
    304         logger.info("Finishing up")
--> 305         self._conclude()
    306         return self
    307 

/usr/local/lib/python3.9/site-packages/MDAnalysis/analysis/rdf.py in _conclude(self)
    387 
    388         # Average number density
--> 389         box_vol = self.volume / self.n_frames
    390         density = N / box_vol
    391 

ZeroDivisionError: float division by zero

Current version of MDAnalysis

Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0b0
Which version of Python (python -V)? 3.9
Which operating system? MacOS

orbeckst commented 3 years ago

For any kind of parallel analysis it would be best if zero-length analysis produced zero-length output arrays or whatever is the identity element when aggregating results (e.g. 0 or a histogram of zeros when the results are added/averaged, empty list when results are concatenated etc).

@mnmelo do you have an opinion?

orbeckst commented 3 years ago

Should have added context: With parallel analysis it is possible that slices with 0 frames are generated (although not smart and unlikely) and then it's much better if these do not error but just return an identity element.

IAlibay commented 3 years ago

For any kind of parallel analysis it would be best if zero-length analysis produced zero-length output arrays or whatever is the identity element when aggregating results (e.g. 0 or a histogram of zeros when the results are added/averaged, empty list when results are concatenated etc).

I agree, 0 length should be an accepted input for run.

In some ways one wonders if it should be better handled at the individual analysis method level (i.e. _conclude methods should really account for the possibility of _single_frame not having been applied to any frame).

That being said, for the sake of simplicity we could just guard self._conclude with an if self.n_frames > 0?

edit: I guess I didn't think through @orbeckst's comment about parallel analysis. Skipping conclude fixes the problem but we don't populate the results with "zero frame" values. I guess if we want to go down that route we have to fix each individual analysis method to return the correct identity results.

mnmelo commented 3 years ago

There's two aspects here:

What to do about _conclude;
How to combine 0-frame analysis with others.

Number 2. is easiest, in that the combining function can be made to skip those 0-frame analyses; no need for them to implement identity results. For existing combination code that isn't aware of the possibility, read on.

The solution to number 1. depends on what comes downstream to a 0-frame analysis. Skipping _conclude will possibly get us AttributeErrors when the subsequent code tries to access Analysis.results.some_stuff_that_was_not_calculated. The analysis creator can be considerate in these cases and implement a _conclude that cleans up, placing zeros, empty lists, and np.nans in the appropriate results slots. But it would be rather harsh that the API demands all analyses to be well-behaved in this sense.

Booby-trapped `.results`

Perhaps a catch-all solution is to implement a custom exception associated to the results: the Analysis.run method booby-traps access to .results if there's no frames iterated over. The API then states that 0-frames are possible, but accessing results will always raise MDAnalysis.analysis.NoFramesError or suchlike. This way the outcome is consistent without the need to rewrite any Analysis. It won't solve the downstream problems, but at least it will error out consistently and obviously.

For the cases where an empty analysis actually makes sense (which?) or when _prepare/_conclude produce usable/combinable .results even with 0 frames, we could provide class attrs for the analysis creator to set and circumvent this mechanism.

Typed `results.attrs`?

Finally, I'm not following the Python typed variables developments very closely, but would it be possible to have the API demand a fully typed results namespace at _prepare time, from which 0-frame defaults could be set automatically? This would really solve everything.

lilyminium commented 3 years ago

typed results

Would something like https://pydantic-docs.helpmanual.io/usage/models/ be helpful?

IAlibay commented 3 years ago

I'm not really fond of enforcing typed results. I feel like AnalysisBase hits a sweet spot in terms of complexity at the moment, and I'm not sure we want to ask too much from downstream library users.

A simpler solution?

Rather than overcomplicating things, maybe the best answer here is to skip _prepare, _single_frame, and _conclude if n_frames == 0? In that case, results should remain empty. In most cases results attributes are not set until _prepare and in the cases where they are (GNM, Lineardensity, PCA, MSD, hbond, waterbridge), there's no reason I can think of for setting them in __init__ over _prepare (unless I'm missing a specific significance of setting attributes to None at __init__).

The only extra thing that would need to be done here is that for zero-length analyses we would need to flush results (and any other user-facing attributes set after class construction). I.e. "zero length analysis" == "not running an analysis" == "user facing class attributes as if newly constructed".

How this would work for `_reduce`

In this case, all one would have to do when doing a _reduce call is check that results is not empty (or a try around an AttributeError)?

mnmelo commented 3 years ago

I like that simplicity, @IAlibay, and besides, typed results is as much work as setting defaults, and equally overly burdensome.

As for flushing .results, I do think that a booby-trapped access is cleaner, API-wise. But it's nitpicking: the downstream user either gets an AttributeError or whatever exception we set to booby-trap (which could inherit from AttributeError).

lilyminium commented 3 years ago

typed results is as much work as setting defaults, and equally overly burdensome.

I'm devil's advocating myself because I'm not such a fan of pydantic, but IMO that burden could be worth it for being explicitly clear on what each analysis provides. If one can e.g.

class InterRDF:

    class Results(pydantic.BaseModel):
        # field is Field(default_value, description="helpful description")
        count: np.ndarray[float] = Field(np.zeros(0)) # or however pydantic does numpy
        edges: List[float] = Field([], description="histogram bin edges")
        bins: List[float] = Field([])
        volume: float = Field(0)
        rdf: List[float] = Field(0)

users know upfront what kind of attributes they can expect in the results and mdacli can probably leverage that somehow, like in help strings.

Edit:

you could initialize it in AnalysisBase.__init__ with self.results = Results()

Edit 2: which is, in fact, already in there...

IAlibay commented 3 years ago

If every analysis class has to implement its own Results class, then we might as well just ensure that all classes implement an identity element for each results variable?

I'll be honestly I'm kinda biased, my main agenda here is to a) avoid increasing our dependencies, b) reduce the amount of changes we need to do for a 2.0.0 release, c) avoid pydantic [I'm not super fond of it or the idea of making downstream developers learn yet another thing].

In an off-github chat @lilyminium off-hand mentioned dataclasses and I'm wondering if that might not be the answer (assuming we want to implement things at the analysis class level).

lilyminium commented 3 years ago

Yeah the pros/cons for pydantic/dataclasses > Results dict as I see it would be:

Pro:

typed attributes
I personally find that more readable
mdacli can inspect the expected attributes of the class instead of waiting for an object to get instantiated

Cons:

external dependency for pydantic
pydantic can be really irritating to use beyond simple textbook applications, although I think this is a simple textbook application
Probably less intuitive for new users to write analyses for?
Not as simple

richardjgowers commented 3 years ago

I’m not too keen on pydantic tbh.

On Tue, Jun 1, 2021 at 18:03, Lily Wang @.***> wrote:

Yeah the pros/cons for pydantic/dataclasses > Results dict as I see it would be:

Pro:

typed attributes

I personally find that more readable

mdacli can inspect the expected attributes of the class instead of waiting for an object to get instantiated

Cons:

external dependency for pydantic

pydantic can be really irritating to use beyond simple textbook applications, although I think this is a simple textbook application

Probably less intuitive for new users to write analyses for?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/3342#issuecomment-852290856, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGBZ3NI6XQS3MWSBEVVTTQUHHPANCNFSM45VHFUPQ .

mnmelo commented 3 years ago

I imagine we don't need to use pydantic to still be able to use typed results classes. I agree with @lilyminium that it is quite readable, more than just setting defaults. Certainly heavier on the analysis creator.

I'm also not sure how pydantic/typed results would handle dynamic types: in your example, @lilyminium, the size of histogram is set at _prepare time, from nbins(set at analysis instantiation time). How would these types indicate that the default should be an array of zeros of size nbins?

richardjgowers commented 3 years ago

Back to the original issue, I think it's probably best if both _prepare and _conclude are called in the 0 length case, this is probably the best bet at getting all expected arrays to the right length. It'll just have to be that _conclude might have to anticipate 0 length intermediate results.

jbarnoud commented 3 years ago

I agree with @richardjgowers here. The simplest solution is to have _conclude handle the 0-length case.

jbarnoud commented 3 years ago

Also, I do like typed things (I am learning rust at the moment, types are great!) and I am in favour of type hinting to anticipate bugs. But I would rather avoid enforcing types. If I wanted a statically typed language I would not be doing python.

mnmelo commented 3 years ago

So, besides adapting the existing analyses, we should word the AnalysisBase API to strongly encourage the catching of 0-frame cases in _conclude (despite it being an optional subclass method).

Interestingly, it seems that Analyses that do not need a _conclude are already somewhat robust to 0-frame cases: begin handwaving AnalysisBase._single_frame just updates state per frame, and should have a valid state to begin with. If no _conclude is needed, it follows that any frame's state can be a valid result, including the starting state. Of course things aren't as simple because an analysis can be aware whether it is in the first frame or not and adapt accordingly, but still... end handwaving

MDAnalysis / mdanalysis