MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.
https://mdanalysis.org
Other
1.33k stars 653 forks source link

`analysis.atomicdistances.AtomicDistances` does not use Results #4819

Open orbeckst opened 13 hours ago

orbeckst commented 13 hours ago

Expected behavior

All analysis classes based on AnalysisBase store results in an instance of MDAnalysis.analysis.results.Results

Results are always stored in the attribute AnalysisBase.results, which is an instance of Results, a kind of dictionary that allows allows item access via attributes. Each analysis class decides what and how to store in Results and needs to document it. For time series, the AnalysisBase.times contains the time stamps of the analyzed frames.

The results should be accessible as, e.g., results.distances.

Actual behavior

analysis.atomicdistances.AtomicDistances.results directly contains a numpy array.

The lack of API conformity makes it impossible to directly parallelize the class, see PR #4808.

Code to reproduce the behavior

See docs: https://docs.mdanalysis.org/stable/documentation_pages/analysis/atomicdistances.html#MDAnalysis.analysis.atomicdistances.AtomicDistances.results

Current version of MDAnalysis

orbeckst commented 13 hours ago

Note that the only option for fixing this issue is to break the existing API: we need the results attribute to now hold a Results instance instead of a numpy array. Thus, we are not able to make a backwards-compatible update. Either we fix it and break it or we add a deprecation warning and fix it for 3.0.

orbeckst commented 13 hours ago

My view is that this was a mistake in the first place as it contradicts our specs and we should treat it as a bug and thus fix it.

This would also allow us to parallelize the class (see PR #4808).

P.S.: The AtomicDistances class was added in 2.5.0 and even our 2.5.0 docs state that results needs to be an instance of Results, see https://docs.mdanalysis.org/2.5.0/documentation_pages/analysis/base.html

IAlibay commented 12 hours ago

I have two questions here:

  1. What was the intention of the coredev(s) who approved it - was this an oversight or an intentional design decision?
  2. How big a deal would it be if we marked it as non-parallelizable and just held back on the change until v3.0.

The reasons I'm asking that second question:

  1. It's pretty clear to me that we really should just do a v3.0 release in the next 3-4 months. Our current status is making dealing with upgrading Python versions (amongst other things) a bit of a pain, and it would just be easier to finish up v2.x than spend ages implementing workarounds.
  2. If we don't think a lot of folks are using that analysis method, then the impact of either not doing the change or doing it is very low. In which case, why not just wait a bit (only if my first point stands true).
talagayev commented 12 hours ago

I have two questions here:

  1. What was the intention of the coredev(s) who approved it - was this an oversight or an intentional design decision?
  2. How big a deal would it be if we marked it as non-parallelizable and just held back on the change until v3.0.

The reasons I'm asking that second question:

  1. It's pretty clear to me that we really should just do a v3.0 release in the next 3-4 months. Our current status is making dealing with upgrading Python versions (amongst other things) a bit of a pain, and it would just be easier to finish up v2.x than spend ages implementing workarounds.
  2. If we don't think a lot of folks are using that analysis method, then the impact of either not doing the change or doing it is very low. In which case, why not just wait a bit (only if my first point stands true).

I could try to see if it is possible to adjust it to get Results if that's fine, especially if there are a couple of months time to adjust it, but can't promise if it works.

As for the parallelization and v3.0, should it be the case that for the upgrade to v3.0 all of the analysis functions should have either implemented parallelization or marked as non parallelizable or is it fine if it is with the current state, where half of the functions have an implementation parallelization/non parallelization marks.

orbeckst commented 12 hours ago

I could try to see if it is possible to adjust it to get Results if that's fine, especially if there are a couple of months time to adjust it, but can't promise if it works.

A PR that fixes this issue would be very welcome, no matter what — this will have to be fixed for 3.0. The question is just if there will be a 2.9.0 or if we will go directly to 3.0.

As for the parallelization and v3.0, should it be the case that for the upgrade to v3.0 all of the analysis functions should have either implemented parallelization or marked as non parallelizable or is it fine if it is with the current state, where half of the functions have an implementation parallelization/non parallelization marks.

For 3.0, all analysis classes should have a defined behavior for parallelization.

talagayev commented 12 hours ago

I could try to see if it is possible to adjust it to get Results if that's fine, especially if there are a couple of months time to adjust it, but can't promise if it works.

A PR that fixes this issue would be very welcome, no matter what — this will have to be fixed for 3.0. The question is just if there will be a 2.9.0 or if we will go directly to 3.0.

As for the parallelization and v3.0, should it be the case that for the upgrade to v3.0 all of the analysis functions should have either implemented parallelization or marked as non parallelizable or is it fine if it is with the current state, where half of the functions have an implementation parallelization/non parallelization marks.

For 3.0, all analysis classes should have a defined behavior for parallelization.

I see, then probably #4817 and #4818 would need to be adressed also first, since those two are lacking AnalysisBase. I raised an Issue for those and would look into them in the free time to see if I can implement the AnalysisBase there

orbeckst commented 12 hours ago

For 3.0, all analysis classes should have a defined behavior for parallelization.

My opinion is that anything that uses AnalysisBase should have well-defined behavior. Legacy code may still do whatever it used to do.

Ideally, all analysis tools that broadly do per-frame analysis use AnalysisBase but I don't think that we promised that for 3.0... but it would be great ;-).

talagayev commented 12 hours ago

For 3.0, all analysis classes should have a defined behavior for parallelization.

My opinion is that anything that uses AnalysisBase should have well-defined behavior. Legacy code may still do whatever it used to do.

Ideally, all analysis tools that broadly do per-frame analysis use AnalysisBase but I don't think that we promised that for 3.0... but it would be great ;-).

Yup, makes sense :) I think it should be quite doable especially for MDAnalysis.analysis.hydrogenbonds.hbond_autocorrel, who has some structural similarities of the code to other AnalysisBase analysis tools, so that should work, the other one MDAnalysis.analysis.distances can't say, but from the first look also doable :)