Replace KDTree library with alternative

kain88-de commented 9 years ago

Is there something against switching the KDTree library?

Currently you are using a pretty old C implementation. It would for example be possible to use the scikit-learn KDTree.

hainm commented 9 years ago

what's wrong with 'old'?

richardjgowers commented 9 years ago

There's also scipy.spatial.KDTree which wouldn't introduce a new dependency

So I guess things to prove would be

benchmarks of 250,000 3d coordinates, both creating the Tree and querying it
checking all methods have a counterpart (same functionality)

The other thing is all we currently use KDTree for is distance based selections I think... And the downside with it is it doesn't handle PBC. So a better solution might be to introduce a capped distance search method (using domain decomposition into cells). I was playing around with making one last month, https://github.com/richardjgowers/cellgrid

This would also be awesome for lots of common stuff in very large systems

kain88-de commented 9 years ago

It's SWIG interface currently doesn't support Python3. scikit-learn would support py2 and py3. I also have good experience with it's performance. It would also mean one less thing which has to be maintained.

kain88-de commented 9 years ago

@richardjgowers what would you suggest is a good test system?

richardjgowers commented 9 years ago

Timeit in ipython is good enough, and we can assume that particles are randomly space, so something like...

import numpy as np
from scipy import spatial
from MDAnalysis.lib import KDTree

coords = np.random.random(750000).reshape(250000, 3).astype(np.float32)

t = KDTree.KDTree(3, 10)
%timeit t.set_coords(coords)

%timeit t2 = spatial.KDTree(coords)

I get this:

In [13]: %timeit t.set_coords(coords) 1 loops, best of 3: 833 ms per loop

In [16]: %timeit t2 = spatial.KDTree(coords) 1 loops, best of 3: 805 ms per loop

I think the setup cost of the KDTree will be the largest cost, but it's also worth checking that all methods are available in the replacement. We don't directly use the KDTree, but wrap it in a MDAnalysis.lib.KDTree.NeighbourSearch.

kain88-de commented 9 years ago

http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4

It looks like the sklearn implementation is a lot faster then the one from scipy.

richardjgowers commented 9 years ago

Hmm ok, and the sklearn KDTree has 96%+ test coverage and supports py3.

For fairness the mda kdtree search was 2x as fast, so after 700,000 searches it's quicker :D

Ok, if you can rewrite NeighborSearch to use the sklearn kdtree then this should be ok unless there's something I've mussed

orbeckst commented 9 years ago

One reason for bundling KDTree was that we didn't really like introducing scipy or sklearn as hard dependencies for core MDAnalysis because in the past this was awful to just pip install.

By default, KDTree is only used for the around selection; point uses distance_matrix if I remember correctly. However, building manual KDTree selections has been pretty useful.

Issue #137 was looking at an implementation of KDTree with PBC.

richardjgowers commented 9 years ago

Re: #137, I don't think there's a way to add PBC at the C level. My best solution to that was to add too many coordinates to the KDTree (replicate the images), and this is done outside of the KDTree. So having our own KDTree doesn't affect the issue

kain88-de commented 9 years ago

I'm just porting the selection object to use the sklearn KDTree. Some of the tests are failing right now. But just seeing that the tests in test_analysis:TestHydrogenBondingAnalysisChecking fail doesn't tell me much what the problem is due to the test design. It is not even that easy to figure out that the test does exactly.

mnmelo commented 9 years ago

I hit a similar test failure in my half-way cleanup of the selection module (I left that on stand-by pending the development of the new AtomGroup model and use of pyparsing; see #371).

I had reflowed much of the point and around code, to reduce duplication, and was not surprised I broke something. But now you find a failure in the same tests. Maybe

Those are the only tests that really use the KDTree feature, and both our fixes are indeed borked; or
That module has a bug somewhere.

I'm not sure which I prefer to be true...

Here's my test error log (notice that something else also fails for test_atomgroup.py). Let me know if you'd like me to upload my half-done branch.

kain88-de commented 9 years ago

Well then we should first refactor the tests to be readable. I have serious problems understanding what the current tests are doing exactly

kain88-de commented 9 years ago

Ok the tests pass now. the test_atomgroup.py tests are much better to finding out what happens and what is wrong.

mnmelo commented 9 years ago

So, was it something in your code, then?

kain88-de commented 9 years ago

yup. didn't understand the old original code. now everything is working. you can check out my progress here. https://github.com/kain88-de/mdanalysis/tree/sklearn-kdtree

dotsdl commented 9 years ago

Did we decide we actually wanted to include scikit-learn as a dependency for a core component? I'm not sure we resolved the question yet.

orbeckst commented 9 years ago

On 16 Aug, 2015, at 12:00, David Dotson wrote:

Did we decide we actually wanted to include scikit-learn as a dependency for a core component? I'm not sure we resolved the question yet.

I am against introducing such a heavy dependency unless there's a REALLY good reason.

If the code is well encapsulated, we could simply take the code and replace the KD-tree one.

Alternatively, trying a grid-search type algorithm (and writing it in cython ourselves) might be a good alternative.

Oliver Beckstein * orbeckst@gmx.net skype: orbeckst * orbeckst@gmail.com

kain88-de commented 9 years ago

I am against introducing such a heavy dependency unless there's a REALLY good reason.

It's faster then the BioPython version and very well maintained.

If the code is well encapsulated, we could simply take the code and replace the KD-tree one.

I don't like this approach. It just puts burden on MDAnalysis to maintain the forked code and merge upstream bug fixes. It is a better and more pythonic approach to use the other libraries directly.

I have switched the KDTree in my branch to use the BioPython KDTree directly. This will also work for Python3

mnmelo commented 9 years ago

Then, if switching in place is that easy I'd say that we depend on either BioPython or scikit-learn. If none are present we install the lighter BioPython. If scikit-learn is available we use that. I'm not sure, though, how to teach these kind of dependencies to setuptools or pip.

The code then uses scikit-learn if available, otherwise it falls back to BioPython. We write a big notice on the install instructions, and maybe even issue a warning when BioPython is used saying speed is to be gained if scikit-learn is installed.

This is all assuming it's easy to just switch frameworks. @kain88-de might share more insight on that (I imagine it won't be as straightforward as having a try: except: on the imports...).

kain88-de commented 9 years ago

@mnmelo it is not that easy since the BioPython KDTree has a different API then the scikit-learn KDTree. Besides I would prefer to use just one to keep the code simple.

mnmelo commented 9 years ago

Ah, that's a shame. But I understand that putting effort into making wrappers around equivalent foreign libraries shouldn't really be our priority here.

orbeckst commented 9 years ago

Introducing sklearn as a dependency requires some serious discussion. I definitely don't want to rip out the old KD-Tree for 0.11.0.

For a start, one thing that vanilla Python (I mean pip/setuptools) is decidedly not good at is making sure that you can easily install packages—have you ever tried to get a package to work somewhere that does not have the full scientific python stack installed? In order to make this installation-hell sort-of-manageable you actually need 3rd party stuff like EPD or conda or at least a good understanding of your distribution's package management system. To be facetious: In some cases "pythonic" does not always equal "user friendly". Therefore, I think that inclusion of 3rd party libs into more complicated packages is actually the right thing to do in many instances because it shifts load from users to developers. And ultimately, I feel that firstly, MDAnalysis must make it easy for users to get work done.

I don't think that inclusion of sklearn as a hard dependency is justified at the moment because we are using only a tiny part of it to provide a tiny part of MDAnalysis functionality. If you look at the other hard dependencies, then they are either essential and provide a lot of functionality (numpy) or small-ish and/or install cleanly (networkx, GridDataFormats) or optional to provide very specific functionality (netCDF4 for Amber NCDF format). You could make a case that sklearn is similar to biopython, which also only contributes a fraction of functionality (one version of the PDBReader, sequence handling), but at least we never had any problems with biopython as a dependency as it always builds cleanly in my experience. (But I would like to get rid of that dependency, too, especially with a view towards Py3k!) Basically, I want to avoid one more stumbling block that prevents a user from easily installing the package (because these are the users that never come back and leave no feedback on what went wrong).

I am eager to listen to everyone's opinions.

hainm commented 9 years ago

I agree with @orbeckst about not to depend too much about sklearn. There is no point to replace the old lib if it works fine. Not all users is expert in Python and not all of them (currently) knows about conda.

dotsdl commented 9 years ago

I agree with @orbeckst as well. Although I very much respect scikit-learn and its high development standards, I don't consider it to be low-level enough in the python stack to make sense as a piece of the core of mdanalysis. It provides high-quality implementations of common machine learning algorithms with a consistent API; that is great, but I don't think that fits for this case.

I think a Cythonized KDTree that is easily able to handle periodic boundary conditions is something we would find desirable. That is something we will have to tackle, and I think that's the only reasonable choice.

richardjgowers commented 9 years ago

We've already got scipy as a dependency, and we could use the KDTree there, but @kain88-de has already shown it's slightly slower (http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4)

I didn't see a problem with using sklearn, it looks like installing it has the same dependencies as numpy/scipy, which would fall into @orbeckst 's "install cleanly" clause. I've just tried making a virtual environment and installing sklearn and it looks painless enough?

@dotsdl the periodic KDTree will be a wrapper around a regular KDTree, so I'd rather use an external (and hopefully optimised) solution and wrap around that

hainm commented 9 years ago

I 've looked at benchmark for 'Test finding in radius of time' mda has the lowest time.

On Aug 20, 2015, at 3:36 AM, Richard Gowers notifications@github.com wrote:

We've already got scipy as a dependency, and we could use the KDTree there, but @kain88-de has already shown it's slightly slower (http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4)

I didn't see a problem with using sklearn, it looks like installing it has the same dependencies as numpy/scipy, which would fall into @orbeckst 's "install cleanly" clause. I've just tried making a virtual environment and installing sklearn and it looks painless enough?

@dotsdl the periodic KDTree will be a wrapper around a regular KDTree, so I'd rather use an external (and hopefully optimised) solution and wrap around that

— Reply to this email directly or view it on GitHub.

kain88-de commented 9 years ago

The PR #395 is currently using the BioPython KDTree. That solution is slightly slower then the scikit-learn version but that shouldn't be noticed to much unless one is doing a lot of different selections in a loop. (scikit-learn will be overall faster since it builds up the Tree incredibly fast, so a little slower search doesn't matter that much here). As far as Py3 is concerned BioPython supports python 2.6 - 3.4 so we are good there as well.

I personally have scikit-learn installed by default anyway. They offer a huge variety of dimension reduction algorithms and clusterings. But I also understand if you want to keep the dependencies for MDAnalysis light.

With regard to installing things. In my experience pip works pretty great today if you are using it consistently. I install pip packages always only for my user with (--user) and have any system-wide installation handled by my distribution package-manager. But I also have seen people who have a messed up environment when they used pip to install packages system-wide and then updated with the distribution package manager. But that's a problem people will have independent of MDAnalysis.

orbeckst commented 9 years ago

Scipy is not a hard dependency, the core runs without it.

I think a fast grid search would be a viable alternative.

We've already got scipy as a dependency,

orbeckst commented 9 years ago

I 've looked at benchmark for 'Test finding in radius of time' mda has the lowest time.

Yes, but tree setup time really matters, as @richardjgowers pointed out in an earlier comment.

orbeckst commented 9 years ago

@kain88-de , using the BioPython KD-tree implementation seems a good compromise. Can you clean up #395 (or submit a new clean PR); at the moment it does not pass the tests.

However, the NeighborSearch wrapper is being used in MDAnalysis.analysis.density and other tools dependent on MDAnalysis so I'd like to keep it. (I am mentioning this because in #395 you asked about it).

kain88-de commented 9 years ago

Only the atomneighbor search is used there. This class still exists but in its own module now. @richardjgowers just has to tell me where that module should ideally live and then I'll fix the failing test.

richardjgowers commented 9 years ago

Keep it in lib, analysis is more for whole tools rather than components.

kain88-de commented 9 years ago

However, the NeighborSearch wrapper is being used in MDAnalysis.analysis.density and other tools dependent on MDAnalysis so I'd like to keep it. (I am mentioning this because in #395 you asked about it).

Oh I missed analysis.density because It doesn't have unit-tests. I change the API and I'm quite confident that it still works but I can't test that currently. I wanted to quickly run the example given in the doc string but that doesn't work anymore.

MDAnalysis / mdanalysis

Replace KDTree library with alternative #383