Open PicoCentauri opened 3 years ago
For any kind of parallel analysis it would be best if zero-length analysis produced zero-length output arrays or whatever is the identity element when aggregating results (e.g. 0 or a histogram of zeros when the results are added/averaged, empty list when results are concatenated etc).
@mnmelo do you have an opinion?
Should have added context: With parallel analysis it is possible that slices with 0 frames are generated (although not smart and unlikely) and then it's much better if these do not error but just return an identity element.
For any kind of parallel analysis it would be best if zero-length analysis produced zero-length output arrays or whatever is the identity element when aggregating results (e.g. 0 or a histogram of zeros when the results are added/averaged, empty list when results are concatenated etc).
I agree, 0 length should be an accepted input for run
.
In some ways one wonders if it should be better handled at the individual analysis method level (i.e. _conclude
methods should really account for the possibility of _single_frame
not having been applied to any frame).
That being said, for the sake of simplicity we could just guard self._conclude
with an if self.n_frames > 0
?
edit: I guess I didn't think through @orbeckst's comment about parallel analysis. Skipping conclude fixes the problem but we don't populate the results with "zero frame" values. I guess if we want to go down that route we have to fix each individual analysis method to return the correct identity results.
There's two aspects here:
_conclude
;Number 2. is easiest, in that the combining function can be made to skip those 0-frame analyses; no need for them to implement identity results. For existing combination code that isn't aware of the possibility, read on.
The solution to number 1. depends on what comes downstream to a 0-frame analysis. Skipping _conclude
will possibly get us AttributeErrors
when the subsequent code tries to access Analysis.results.some_stuff_that_was_not_calculated
. The analysis creator can be considerate in these cases and implement a _conclude
that cleans up, placing zeros, empty lists, and np.nans
in the appropriate results
slots. But it would be rather harsh that the API demands all analyses to be well-behaved in this sense.
.results
Perhaps a catch-all solution is to implement a custom exception associated to the results: the Analysis.run
method booby-traps access to .results
if there's no frames iterated over. The API then states that 0-frames are possible, but accessing results will always raise MDAnalysis.analysis.NoFramesError
or suchlike. This way the outcome is consistent without the need to rewrite any Analysis. It won't solve the downstream problems, but at least it will error out consistently and obviously.
For the cases where an empty analysis actually makes sense (which?) or when _prepare
/_conclude
produce usable/combinable .results
even with 0 frames, we could provide class attrs for the analysis creator to set and circumvent this mechanism.
results.attrs
?Finally, I'm not following the Python typed variables developments very closely, but would it be possible to have the API demand a fully typed results
namespace at _prepare
time, from which 0-frame defaults could be set automatically? This would really solve everything.
typed
results
Would something like https://pydantic-docs.helpmanual.io/usage/models/ be helpful?
I'm not really fond of enforcing typed results
. I feel like AnalysisBase hits a sweet spot in terms of complexity at the moment, and I'm not sure we want to ask too much from downstream library users.
Rather than overcomplicating things, maybe the best answer here is to skip _prepare
, _single_frame
, and _conclude
if n_frames == 0? In that case, results
should remain empty. In most cases results attributes are not set until _prepare
and in the cases where they are (GNM, Lineardensity, PCA, MSD, hbond, waterbridge), there's no reason I can think of for setting them in __init__
over _prepare
(unless I'm missing a specific significance of setting attributes to None
at __init__
).
The only extra thing that would need to be done here is that for zero-length analyses we would need to flush results
(and any other user-facing attributes set after class construction). I.e. "zero length analysis" == "not running an analysis" == "user facing class attributes as if newly constructed".
_reduce
In this case, all one would have to do when doing a _reduce
call is check that results
is not empty (or a try
around an AttributeError
)?
I like that simplicity, @IAlibay, and besides, typed results is as much work as setting defaults, and equally overly burdensome.
As for flushing .results,
I do think that a booby-trapped access is cleaner, API-wise. But it's nitpicking: the downstream user either gets an AttributeError
or whatever exception we set to booby-trap (which could inherit from AttributeError
).
typed results is as much work as setting defaults, and equally overly burdensome.
I'm devil's advocating myself because I'm not such a fan of pydantic, but IMO that burden could be worth it for being explicitly clear on what each analysis provides. If one can e.g.
class InterRDF:
class Results(pydantic.BaseModel):
# field is Field(default_value, description="helpful description")
count: np.ndarray[float] = Field(np.zeros(0)) # or however pydantic does numpy
edges: List[float] = Field([], description="histogram bin edges")
bins: List[float] = Field([])
volume: float = Field(0)
rdf: List[float] = Field(0)
users know upfront what kind of attributes they can expect in the results and mdacli can probably leverage that somehow, like in help strings.
Edit:
you could initialize it in AnalysisBase.__init__
with self.results = Results()
Edit 2: which is, in fact, already in there...
If every analysis class has to implement its own Results
class, then we might as well just ensure that all classes implement an identity element for each results variable?
I'll be honestly I'm kinda biased, my main agenda here is to a) avoid increasing our dependencies, b) reduce the amount of changes we need to do for a 2.0.0 release, c) avoid pydantic [I'm not super fond of it or the idea of making downstream developers learn yet another thing].
In an off-github chat @lilyminium off-hand mentioned dataclasses and I'm wondering if that might not be the answer (assuming we want to implement things at the analysis class level).
Yeah the pros/cons for pydantic/dataclasses > Results dict as I see it would be:
Pro:
Cons:
I’m not too keen on pydantic tbh.
On Tue, Jun 1, 2021 at 18:03, Lily Wang @.***> wrote:
Yeah the pros/cons for pydantic/dataclasses > Results dict as I see it would be:
Pro:
- typed attributes
- I personally find that more readable
- mdacli can inspect the expected attributes of the class instead of waiting for an object to get instantiated
Cons:
- external dependency for pydantic
- pydantic can be really irritating to use beyond simple textbook applications, although I think this is a simple textbook application
- Probably less intuitive for new users to write analyses for?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/3342#issuecomment-852290856, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGBZ3NI6XQS3MWSBEVVTTQUHHPANCNFSM45VHFUPQ .
I imagine we don't need to use pydantic
to still be able to use typed results classes. I agree with @lilyminium that it is quite readable, more than just setting defaults. Certainly heavier on the analysis creator.
I'm also not sure how pydantic
/typed results would handle dynamic types: in your example, @lilyminium, the size of histogram is set at _prepare
time, from nbins
(set at analysis instantiation time). How would these types indicate that the default should be an array of zeros of size nbins
?
Back to the original issue, I think it's probably best if both _prepare and _conclude are called in the 0 length case, this is probably the best bet at getting all expected arrays to the right length. It'll just have to be that _conclude might have to anticipate 0 length intermediate results.
I agree with @richardjgowers here. The simplest solution is to have _conclude
handle the 0-length case.
Also, I do like typed things (I am learning rust at the moment, types are great!) and I am in favour of type hinting to anticipate bugs. But I would rather avoid enforcing types. If I wanted a statically typed language I would not be doing python.
So, besides adapting the existing analyses, we should word the AnalysisBase API to strongly encourage the catching of 0-frame cases in _conclude
(despite it being an optional subclass method).
Interestingly, it seems that Analyses that do not need a _conclude
are already somewhat robust to 0-frame cases:
begin handwaving
AnalysisBase._single_frame
just updates state per frame, and should have a valid state to begin with. If no _conclude
is needed, it follows that any frame's state can be a valid result, including the starting state. Of course things aren't as simple because an analysis can be aware whether it is in the first frame or not and adapt accordingly, but still...
end handwaving
Expected behavior
Running an analysis module using a
start
/stop
/step
combination leading to length 0 trajectory slices should/could raise a meaningful error.Actual behavior
If an analysis is run over zero frames, inconsistent errors are raised (See below). I know that looping over length zero lists is totally fine in Python. However, running a trajectory analysis over 0 frames is a bit bizarre.
Code to reproduce the behavior
RMSF
RDF
Current version of MDAnalysis
python -c "import MDAnalysis as mda; print(mda.__version__)"
) 2.0.0b0python -V
)? 3.9