iparask commented 5 years ago

Fixes #47

Changes made in this Pull Request:

Adding new LeafletFInder class which analyzes in parallel a frame and reports the two leaflets.
Added a test file.
Add Timings functionality
Traverse over multiple frames.

This PR can be used as a place holder. I would like to initiate a discussion as to as the written code is based on the requirements. I would like to report two major differences from other cases. This code uses Dask Bags to execute instead of delayed. In addition, parallelization is done on a frame basis. This was the most efficient implementation we had in the paper (code linked in the respective issue).

Please review/ comment as necessary and code can grow.

Giannis

PR Checklist

[x] Tests?
[x] Docs?
[x] CHANGELOG updated?
[x] Issue raised/referenced?

orbeckst commented 5 years ago

@iparask sklearn is not getting installed. https://travis-ci.org/MDAnalysis/pmda/jobs/435384534#L1101

For MDAnalysis, sklearn is optional. Is BallTree required for your code or can you make it optional?

iparask commented 5 years ago

The most efficient one uses BallTrees. It also reduces the memory footprint a lot. Let me include KDTrees instead, which are part of Scipy. Otherwise we will have to go with cdist or something similar. @orbeckst, I missed the travis output. Sorry about that

orbeckst commented 5 years ago

Is BallTree a lot better?

MDAnalysis also recently got capped_distance which could perhaps be used instead. We now use this for neighborsearching. Importantly, it takes periodicity into account.

@richardjgowers can you comment on capped_distance() vs BallTree() – did this come up during @ayushsuhane 's benchmarking during GSOC?

richardjgowers commented 5 years ago

@orbeckst a BallTree is just a KDTree (I think...). grid benched faster than KDTree nearly all of the time, so if we can use that it would be better

iparask commented 5 years ago

@orbeckst @richardjgowers I have a wiki page at least with some information for KD-Trees and BallTrees. Also, I remember @drelu had created a plot comparing different trees, I will put it here as soon as I find it.

iparask commented 5 years ago

Here is Andre's experiment link

I also found this benchmarking blog

kain88-de commented 5 years ago

@iparask we had a student (@ayushsuhane) putting in a lot of work in the summer to efficiently calculate distances. Have a look at his blog. He implemented several algorithms in MDAnalysis to efficiently calculate distances with a cutoff. He has found that the cell-list algorithm yielded the best results. It's available in MDAnalysis as lib.distances.capped_distance in the develop branch. Please have look at these functions. All functions included in MDAnalysis also take care of periodicity for triclinic boxes.

orbeckst commented 5 years ago

Currently only as simple PEP8 check fails.

You can run the PEP8 checks locally, too (together with coverage):

pytest  --pep8 -v --cov pmda  pmda

(needs pytest-pep8 and pytest-cov).

orbeckst commented 5 years ago

@iparask there's still an import of BallTree somewhere:

==================================== ERRORS ====================================
__________________ ERROR collecting pmda/test/test_leaflet.py __________________
ImportError while importing test module '/home/travis/build/MDAnalysis/pmda/pmda/test/test_leaflet.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
pmda/test/test_leaflet.py:10: in <module>
    from pmda import leaflet
pmda/leaflet.py:28: in <module>
    from sklearn.neighbors import BallTree
E   ModuleNotFoundError: No module named 'sklearn'

Once the tests pass and you need a final review, please ping me. Thanks!

codecov[bot] commented 5 years ago

Codecov Report

Merging #69 into master will decrease coverage by 1.34%. The diff coverage is 94.64%.

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
- Coverage   99.27%   97.92%   -1.35%     
==========================================
  Files           7        8       +1     
  Lines         274      386     +112     
  Branches       26       48      +22     
==========================================
+ Hits          272      378     +106     
- Misses          1        4       +3     
- Partials        1        4       +3

Impacted Files	Coverage Δ
pmda/leaflet.py	`94.64% <94.64%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3d78fc7...ce5d471. Read the comment docs.

iparask commented 5 years ago

@orbeckst, this is a request for a final review. I have added all parts as asked. I believe it is as close as I can see it getting. Travis fails with warnings mainly because I have overridden and changed the methods of the base class in terms of input parameters.

orbeckst commented 5 years ago

We'll have to fix the pylint warnings or exclude them; if we merge it like this, all future CI will fail.

Run

pylint pmda

locally while testing. I get:

Using config file /Volumes/Data/oliver/Biop/Projects/Methods/MDAnalysis/pmda/.pylintrc
************* Module pmda.leaflet
pmda/leaflet.py:73: [W0231(super-init-not-called), LeafletFinder.__init__] __init__ method from base class 'ParallelAnalysisBase' is not called
pmda/leaflet.py:115: [E1102(not-callable), LeafletFinder._find_connected_components] cKDTree is not callable
pmda/leaflet.py:159: [W0221(arguments-differ), LeafletFinder._single_frame] Parameters differ from overridden '_single_frame' method
pmda/leaflet.py:192: [W1619(old-division), LeafletFinder._single_frame] division w/o __future__ statement
pmda/leaflet.py:229: [W0221(arguments-differ), LeafletFinder.run] Parameters differ from overridden 'run' method

-----------------------------------
Your code has been rated at 9.77/10

Most of these warnings look fixable to me

pmda/leaflet.py:115: [E1102(not-callable), LeafletFinder._find_connected_components] cKDTree is not callable — ???
pmda/leaflet.py:192: [W1619(old-division), LeafletFinder._single_frame] division w/o future statement — add the future import or replace with //
pmda/leaflet.py:159: [W0221(arguments-differ), LeafletFinder._single_frame] Parameters differ from overridden '_single_frame' method — I think you still need to be able to accept all arguments that the standard _single_frame() could and just ignore them.
pmda/leaflet.py:229: [W0221(arguments-differ), LeafletFinder.run] Parameters differ from overridden 'run' method — We might have decided on the different call signature during the review process. However, if at all possible we should keep the call generic signature intact.

Disable the other pylint warning in code by bracketing the code with # pylint: disable=.../# pylint: enable=... blocks; see for example https://github.com/MDAnalysis/pmda/blob/3d78fc7c90e19cdc72932412b05c4c5d2d567fe7/pmda/rdf.py#L90

orbeckst commented 5 years ago

On a second look, the mismatching call signatures might be a bit more difficult.... start with disabling the warnings in the code. I'll then comment in a code review.

(One way or another, the linter has to pass.)

orbeckst commented 5 years ago

Looking good – is there a way to increase test coverage? It will decrease our 99% to 96%...

orbeckst commented 5 years ago

Actually, just checked the diff – there's not much you can do for coverage.

iparask commented 5 years ago

Can you help me read the output of the codecov bot you included? I think that if we can test using the distributed scheduler the coverage will increase.

orbeckst commented 5 years ago

~@ianmkenney~ @iparask in PR #66 we will automatically run all possible parallelizations so no need to do this manually here.

Just add yourself as an author and we're done here.

EDIT: sorry, autocompleted the wrong username

orbeckst commented 5 years ago

Thank you @iparask , congratulations to your first contribution!

MDAnalysis / pmda

Fix 47 pleafletfinder #69

PR Checklist

Codecov Report