MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.31k stars 648 forks source link

Obscure Eigenvalue error when calling shape_parameter for empty ag #3837

Closed jaclark5 closed 2 years ago

jaclark5 commented 2 years ago

Expected behavior

When calculating the shape_parameter of an AtomGroup, provide error if the group is empty. This is best placed in center_of_mass and center_of_charge since all other functions reference these two methods first.

Actual behavior

Obscure error that can eventually be determined to result from an empty AtomGroup: numpy.linalg.LinAlgError: Eigenvalues did not converge

Code to reproduce the behavior

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PDB
universe = mda.Universe(PDB)
ag = universe.select_atoms("type 1")
print(len(ag), ag.shape_parameter())

Current version of MDAnalysis

jaclark5 commented 2 years ago

An example of how to resolve this issue is shown here: https://github.com/jaclark5/mdanalysis/tree/empty_atomgroup

orbeckst commented 2 years ago

Would mind making a PR? Thanks.

jaclark5 commented 2 years ago

I added a draft of a pull request, is that appropriate? The way I read the contributors page suggested that I need to have an issue with a conversation before making Pull Requests.

orbeckst commented 2 years ago

Thanks for the draft PR, that's great!

The way I read the contributors page suggested that I need to have an issue with a conversation before making Pull Requests.

I don't know if you're referring to https://userguide.mdanalysis.org/stable/contributing_code.html#contributing-to-the-main-codebase :

If your contribution is major, such as a bug fix or a new feature, start by opening an issue first. That way, other people can weigh in on the discussion before you do any work. If you also create a pull request, you should link it to the issue by including the issue number in the pull request’s description.

This is supposed to mean that we really appreciate any and all contributions but that we sometimes have a good idea if something fits into the core library or not or if a particular issue has a non-obvious solution, so we don't want you to waste your time doing a lot of work in advance. But at the end of the day, a PR is a clear way to show how you think a problem can be solved so it's always welcome.

orbeckst commented 2 years ago

I looked at the draft PR #3839 . At the moment, center_of_mass() and center_of_charge() return empty arrays of shape (0, 3) when applied to an empty AtomGroup. My initial reaction was that this seems to be reasonable behavior, even though it's not documented.

However, these empty arrays are somewhat peculiar objects. Let's make one:

import MDAnalysis as mda; from MDAnalysisTests import datafiles as data
u = mda.Universe(data.TPR, data.XTC)
empty = u.atoms[:0]   # empty AtomGroup

x = u.atoms[:5].positions
e =  empty.center_of_mass()

where x and e are

In [28]: x
Out[28]:
array([[52.02    , 43.560005, 31.550003],
       [51.190002, 44.11    , 31.720001],
       [51.550003, 42.83    , 31.04    ],
       [52.47    , 43.170006, 32.370003],
       [53.07    , 44.21    , 30.75    ]], dtype=float32)

In [29]: e
Out[29]: array([], shape=(0, 3), dtype=float32)

What's weird is how e behaves: It does not broadcast:

In [30]: x + e
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [30], line 1
----> 1 x + e

ValueError: operands could not be broadcast together with shapes (5,3) (0,3)

whereas x + x[0] works (x[0].shape == (1, 3)).

But using this shaped, literally empty array in any kind of operations creates another empty one

In [32]: x[0] + e
Out[32]: array([], shape=(0, 3), dtype=float32)

In [33]: e * x[0]
Out[33]: array([], shape=(0, 3), dtype=float32)

In [34]: 42 * e
Out[34]: array([], shape=(0, 3), dtype=float32)

It's kind of NaN value.

It's a fair question if we want the existing behavior or if we want to replace it, e.g., with raising an exception. More broadly, what should be the behavior of a method of an AtomGroup if the method cannot provide a sensible answer?

Opinions welcome, @MDAnalysis/coredevs .

hmacdope commented 2 years ago

I would argue an exception is the right behaviour.

orbeckst commented 2 years ago

I am also tending towards exception.

Will we consider switching to an exception a fix (then it can go into release 2.4) or is it a behavior change (then it might have to wait for 3.0). Any opinions, especially @IAlibay ?

IAlibay commented 2 years ago

Sorry there's a lot of threads here and it's hard to see what the impact is. May I ask for a bit of a ELI5 on which bits of the codes are we looking to specifically change?

jaclark5 commented 2 years ago

Wait which function are we talking about? The original issue is that the anisotropy methods fail (and soon multipole moments ;) )with an obscure error when the group is empty. Couldn't I add an exception there? Originally I suggested a change to centerof* because they were referenced in all those other features.

orbeckst commented 2 years ago

I had to look up ELI5... The question that @hmacdope and I have been moving towards extends beyond this particular issue: What should our AtomGroups do when a calculation is requested of an empty AtomGroup?

This question is a bit of an edge case because normally AtomGroups only have methods attached that are suitable for the attributes held by the group and so you typically don't run into a situation where a calculation will become non-sensical (incorrectly guessed masses being a glaring exception for things like center_of_mass() etc). This issue showed that it can happen that as part of normal calculations, empty AtomGroups are generated and they return weird numpy stuff (https://github.com/MDAnalysis/mdanalysis/issues/3837#issuecomment-1253033215), namely shaped empty-set arrays that function as NaNs in the sense that they turn all calculations that they participate in also into these weird arrays (maybe @tylerjereddy can say more about these arrays??). We have been moving towards the following view :

  1. Calculations on empty AtomGroups should raise ValueError (or maybe exceptions.NoDataError https://github.com/MDAnalysis/mdanalysis/blob/fa8b03b210827e288bcdf5c77898f6a4faca1385/package/MDAnalysis/exceptions.py#L35 which is derived from ValueError (and AttributeError)?).
  2. We consider the current behavior a bug/currently undefined behavior (and so we can fix it before 3.0).

One can make the case that the mass or charge of an empty group should be 0. But I have no idea what the center of mass or centroid of an empty group should be — certainly, [0, 0, 0] is not the correct answer because for certain molecules it could just be that. In order to be consistent (and following the Python Zen "fail early and often"), I'd suggest that any calculation on an empty AtomGroup raises an exception.

lilyminium commented 2 years ago

I voted for an exception or NaN in the Discord poll, because that seems more consistent with numpy:

>>> import numpy as np
>>> np.mean([])
/Users/lily/anaconda3/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3432: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/Users/lily/anaconda3/lib/python3.9/site-packages/numpy/core/_methods.py:190: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
nan
orbeckst commented 2 years ago

We had a discussion and a straw-poll in the public #developer channel on Discord. Here's the outcome of the discussion:

  1. Nobody likes the current behavior so we will change it.
  2. The majority seems to favor a hard fail with an exception. The more numpy-style approach to return NaN (and let the user figure out eventually that something went wrong somewhere) also had some proponents.

My fear is that with NaN we get more obscure error reports that we have to help track down to an empty AtomGroup somewhere, thus creating more support burden. Therefore, I am very much inclined to go with raising an exception. (I'd be following Errors should never pass silently. (from the Zen of Python) and "fail early" (a programming principle dating back to at least 2004).)

Based on the characterization of ValueError

Raised when an operation or function receives an argument that has the right type but an inappropriate value [...]

I'd say we raise ValueError.

IAlibay commented 2 years ago

I was actually thinking NoDataError would make more sense? If it's a method then you're error-ing on a lack of data not the fact that it's the wrong value (because technically there are no values?)

Re: API break, I can live with this being a 2.4.0 change, but I would ask that we be very loud about it.

richardjgowers commented 2 years ago

I always thought NoDataError was more like missing topology attribute, rather than zero sized input.

On Fri, Sep 23, 2022 at 19:53, Irfan Alibay @.***> wrote:

I was actually thinking NoDataError would make more sense? If it's a method then you're error-ing on a lack of data not the fact that it's the wrong value (because technically there are no values?)

Re: API break, I can live with this being a 2.4.0 change, but I would ask that we be very loud about it.

— Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/3837#issuecomment-1256557676, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB2TRR26LECUIG4NZN3V7X4AZANCNFSM6AAAAAAQQSXAOE . You are receiving this because you are on a team that was mentioned.Message ID: @.***>

IAlibay commented 2 years ago

I always thought NoDataError was more like missing topology attribute, rather than zero sized input.

On Fri, Sep 23, 2022 at 19:53, Irfan Alibay @.***> wrote:

I was actually thinking NoDataError would make more sense? If it's a method then you're error-ing on a lack of data not the fact that it's the wrong value (because technically there are no values?)

Re: API break, I can live with this being a 2.4.0 change, but I would ask that we be very loud about it.

— Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/3837#issuecomment-1256557676, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGSGB2TRR26LECUIG4NZN3V7X4AZANCNFSM6AAAAAAQQSXAOE . You are receiving this because you are on a team that was mentioned.Message ID: @.***>

It's a bit of an academic discussion for not a big deal either way tbh, but that falls within my understanding of the chain of failure.

Empty atomgroup -> no masses -> can't use masses

The attribute is still there but only by virtue that our default is that we default to nothing instead of raising an error when there's nothing?

orbeckst commented 2 years ago

Similar to https://github.com/MDAnalysis/mdanalysis/issues/3837#issuecomment-1256579470 I'd also seen NoDataError more like to missing something from the topology. However, after a quick git grep NoDataError it looks as if we are using it quite liberally to mean "you haven't done something that would provide data that I need" or "this timestep doesn't have what I need right now (for whatever reasons)". Based on the current usage practice, I'd also be ok with NoDataError.

NoDataError is derived from ValueError (and AttributeError) so we can treat it as ValueError when catching exceptions. If we start out with ValueError then it's not a problem to make it a NoDataError later because code will still work. Turning a NoDataError into ValueError is a breaking change, though.

(RuntimeError is also a possibility but I'd prefer one of the other, more specific errors.)

Do we want NoDataError instead of ValueError?