MDAnalysis / pmda

Parallel algorithms for MDAnalysis
https://www.mdanalysis.org/pmda/
Other
31 stars 22 forks source link

Fix 47 pleafletfinder #69

Closed iparask closed 5 years ago

iparask commented 5 years ago

Fixes #47

Changes made in this Pull Request:

This PR can be used as a place holder. I would like to initiate a discussion as to as the written code is based on the requirements. I would like to report two major differences from other cases. This code uses Dask Bags to execute instead of delayed. In addition, parallelization is done on a frame basis. This was the most efficient implementation we had in the paper (code linked in the respective issue).

Please review/ comment as necessary and code can grow.

Giannis

PR Checklist

orbeckst commented 5 years ago

@iparask sklearn is not getting installed. https://travis-ci.org/MDAnalysis/pmda/jobs/435384534#L1101

For MDAnalysis, sklearn is optional. Is BallTree required for your code or can you make it optional?

iparask commented 5 years ago

The most efficient one uses BallTrees. It also reduces the memory footprint a lot. Let me include KDTrees instead, which are part of Scipy. Otherwise we will have to go with cdist or something similar. @orbeckst, I missed the travis output. Sorry about that

orbeckst commented 5 years ago

Is BallTree a lot better?

MDAnalysis also recently got capped_distance which could perhaps be used instead. We now use this for neighborsearching. Importantly, it takes periodicity into account.

@richardjgowers can you comment on capped_distance() vs BallTree() – did this come up during @ayushsuhane 's benchmarking during GSOC?

richardjgowers commented 5 years ago

@orbeckst a BallTree is just a KDTree (I think...). grid benched faster than KDTree nearly all of the time, so if we can use that it would be better

iparask commented 5 years ago

@orbeckst @richardjgowers I have a wiki page at least with some information for KD-Trees and BallTrees. Also, I remember @drelu had created a plot comparing different trees, I will put it here as soon as I find it.

iparask commented 5 years ago

Here is Andre's experiment link

I also found this benchmarking blog

kain88-de commented 5 years ago

@iparask we had a student (@ayushsuhane) putting in a lot of work in the summer to efficiently calculate distances. Have a look at his blog. He implemented several algorithms in MDAnalysis to efficiently calculate distances with a cutoff. He has found that the cell-list algorithm yielded the best results. It's available in MDAnalysis as lib.distances.capped_distance in the develop branch. Please have look at these functions. All functions included in MDAnalysis also take care of periodicity for triclinic boxes.

orbeckst commented 5 years ago

Currently only as simple PEP8 check fails.

You can run the PEP8 checks locally, too (together with coverage):

pytest  --pep8 -v --cov pmda  pmda

(needs pytest-pep8 and pytest-cov).

orbeckst commented 5 years ago

@iparask there's still an import of BallTree somewhere:

==================================== ERRORS ====================================
__________________ ERROR collecting pmda/test/test_leaflet.py __________________
ImportError while importing test module '/home/travis/build/MDAnalysis/pmda/pmda/test/test_leaflet.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
pmda/test/test_leaflet.py:10: in <module>
    from pmda import leaflet
pmda/leaflet.py:28: in <module>
    from sklearn.neighbors import BallTree
E   ModuleNotFoundError: No module named 'sklearn'

Once the tests pass and you need a final review, please ping me. Thanks!

codecov[bot] commented 5 years ago

Codecov Report

Merging #69 into master will decrease coverage by 1.34%. The diff coverage is 94.64%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
- Coverage   99.27%   97.92%   -1.35%     
==========================================
  Files           7        8       +1     
  Lines         274      386     +112     
  Branches       26       48      +22     
==========================================
+ Hits          272      378     +106     
- Misses          1        4       +3     
- Partials        1        4       +3
Impacted Files Coverage Δ
pmda/leaflet.py 94.64% <94.64%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3d78fc7...ce5d471. Read the comment docs.

iparask commented 5 years ago

@orbeckst, this is a request for a final review. I have added all parts as asked. I believe it is as close as I can see it getting. Travis fails with warnings mainly because I have overridden and changed the methods of the base class in terms of input parameters.

orbeckst commented 5 years ago

We'll have to fix the pylint warnings or exclude them; if we merge it like this, all future CI will fail.

Run

pylint pmda

locally while testing. I get:

Using config file /Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/pmda/.pylintrc
************* Module pmda.leaflet
pmda/leaflet.py:73: [W0231(super-init-not-called), LeafletFinder.__init__] __init__ method from base class 'ParallelAnalysisBase' is not called
pmda/leaflet.py:115: [E1102(not-callable), LeafletFinder._find_connected_components] cKDTree is not callable
pmda/leaflet.py:159: [W0221(arguments-differ), LeafletFinder._single_frame] Parameters differ from overridden '_single_frame' method
pmda/leaflet.py:192: [W1619(old-division), LeafletFinder._single_frame] division w/o __future__ statement
pmda/leaflet.py:229: [W0221(arguments-differ), LeafletFinder.run] Parameters differ from overridden 'run' method

-----------------------------------
Your code has been rated at 9.77/10

Most of these warnings look fixable to me

Disable the other pylint warning in code by bracketing the code with # pylint: disable=.../# pylint: enable=... blocks; see for example https://github.com/MDAnalysis/pmda/blob/3d78fc7c90e19cdc72932412b05c4c5d2d567fe7/pmda/rdf.py#L90

orbeckst commented 5 years ago

On a second look, the mismatching call signatures might be a bit more difficult.... start with disabling the warnings in the code. I'll then comment in a code review.

(One way or another, the linter has to pass.)

orbeckst commented 5 years ago

Looking good – is there a way to increase test coverage? It will decrease our 99% to 96%...

orbeckst commented 5 years ago

Actually, just checked the diff – there's not much you can do for coverage.

iparask commented 5 years ago

Can you help me read the output of the codecov bot you included? I think that if we can test using the distributed scheduler the coverage will increase.

orbeckst commented 5 years ago

~@ianmkenney~ @iparask in PR #66 we will automatically run all possible parallelizations so no need to do this manually here.

Just add yourself as an author and we're done here.

EDIT: sorry, autocompleted the wrong username

orbeckst commented 5 years ago

Thank you @iparask , congratulations to your first contribution!