Closed kain88-de closed 9 years ago
what's wrong with 'old'?
There's also scipy.spatial.KDTree which wouldn't introduce a new dependency
So I guess things to prove would be
The other thing is all we currently use KDTree for is distance based selections I think... And the downside with it is it doesn't handle PBC. So a better solution might be to introduce a capped distance search method (using domain decomposition into cells). I was playing around with making one last month, https://github.com/richardjgowers/cellgrid
This would also be awesome for lots of common stuff in very large systems
It's SWIG interface currently doesn't support Python3. scikit-learn would support py2 and py3. I also have good experience with it's performance. It would also mean one less thing which has to be maintained.
@richardjgowers what would you suggest is a good test system?
Timeit in ipython is good enough, and we can assume that particles are randomly space, so something like...
import numpy as np
from scipy import spatial
from MDAnalysis.lib import KDTree
coords = np.random.random(750000).reshape(250000, 3).astype(np.float32)
t = KDTree.KDTree(3, 10)
%timeit t.set_coords(coords)
%timeit t2 = spatial.KDTree(coords)
I get this:
In [13]: %timeit t.set_coords(coords) 1 loops, best of 3: 833 ms per loop
In [16]: %timeit t2 = spatial.KDTree(coords) 1 loops, best of 3: 805 ms per loop
I think the setup cost of the KDTree will be the largest cost, but it's also worth checking that all methods are available in the replacement. We don't directly use the KDTree, but wrap it in a MDAnalysis.lib.KDTree.NeighbourSearch
.
http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4
It looks like the sklearn implementation is a lot faster then the one from scipy.
Hmm ok, and the sklearn KDTree has 96%+ test coverage and supports py3.
For fairness the mda kdtree search was 2x as fast, so after 700,000 searches it's quicker :D
Ok, if you can rewrite NeighborSearch to use the sklearn kdtree then this should be ok unless there's something I've mussed
One reason for bundling KDTree was that we didn't really like introducing scipy or sklearn as hard dependencies for core MDAnalysis because in the past this was awful to just pip install
.
By default, KDTree is only used for the around
selection; point
uses distance_matrix
if I remember correctly. However, building manual KDTree selections has been pretty useful.
Issue #137 was looking at an implementation of KDTree with PBC.
Re: #137, I don't think there's a way to add PBC at the C level. My best solution to that was to add too many coordinates to the KDTree (replicate the images), and this is done outside of the KDTree. So having our own KDTree doesn't affect the issue
I'm just porting the selection object to use the sklearn KDTree. Some of the tests are failing right now. But just seeing that the tests in test_analysis:TestHydrogenBondingAnalysisChecking
fail doesn't tell me much what the problem is due to the test design. It is not even that easy to figure out that the test
does exactly.
I hit a similar test failure in my half-way cleanup of the selection module (I left that on stand-by pending the development of the new AtomGroup
model and use of pyparsing
; see #371).
I had reflowed much of the point
and around
code, to reduce duplication, and was not surprised I broke something. But now you find a failure in the same tests. Maybe
I'm not sure which I prefer to be true...
Here's my test error log (notice that something else also fails for test_atomgroup.py
).
Let me know if you'd like me to upload my half-done branch.
Well then we should first refactor the tests to be readable. I have serious problems understanding what the current tests are doing exactly
Ok the tests pass now. the test_atomgroup.py
tests are much better to finding out what happens and what is wrong.
So, was it something in your code, then?
yup. didn't understand the old original code. now everything is working. you can check out my progress here. https://github.com/kain88-de/mdanalysis/tree/sklearn-kdtree
Did we decide we actually wanted to include scikit-learn as a dependency for a core component? I'm not sure we resolved the question yet.
On 16 Aug, 2015, at 12:00, David Dotson wrote:
Did we decide we actually wanted to include scikit-learn as a dependency for a core component? I'm not sure we resolved the question yet.
I am against introducing such a heavy dependency unless there's a REALLY good reason.
If the code is well encapsulated, we could simply take the code and replace the KD-tree one.
Alternatively, trying a grid-search type algorithm (and writing it in cython ourselves) might be a good alternative.
Oliver Beckstein * orbeckst@gmx.net skype: orbeckst * orbeckst@gmail.com
I am against introducing such a heavy dependency unless there's a REALLY good reason.
It's faster then the BioPython version and very well maintained.
If the code is well encapsulated, we could simply take the code and replace the KD-tree one.
I don't like this approach. It just puts burden on MDAnalysis to maintain the forked code and merge upstream bug fixes. It is a better and more pythonic approach to use the other libraries directly.
I have switched the KDTree in my branch to use the BioPython KDTree directly. This will also work for Python3
Then, if switching in place is that easy I'd say that we depend on either BioPython or scikit-learn. If none are present we install the lighter BioPython. If scikit-learn is available we use that. I'm not sure, though, how to teach these kind of dependencies to setuptools
or pip
.
The code then uses scikit-learn if available, otherwise it falls back to BioPython. We write a big notice on the install instructions, and maybe even issue a warning when BioPython is used saying speed is to be gained if scikit-learn is installed.
This is all assuming it's easy to just switch frameworks. @kain88-de might share more insight on that (I imagine it won't be as straightforward as having a try:
except:
on the imports...).
@mnmelo it is not that easy since the BioPython KDTree has a different API then the scikit-learn KDTree. Besides I would prefer to use just one to keep the code simple.
Ah, that's a shame. But I understand that putting effort into making wrappers around equivalent foreign libraries shouldn't really be our priority here.
Introducing sklearn
as a dependency requires some serious discussion. I definitely don't want to rip out the old KD-Tree for 0.11.0.
For a start, one thing that vanilla Python (I mean pip
/setuptools
) is decidedly not good at is making sure that you can easily install packages—have you ever tried to get a package to work somewhere that does not have the full scientific python stack installed? In order to make this installation-hell sort-of-manageable you actually need 3rd party stuff like EPD or conda or at least a good understanding of your distribution's package management system. To be facetious: In some cases "pythonic" does not always equal "user friendly". Therefore, I think that inclusion of 3rd party libs into more complicated packages is actually the right thing to do in many instances because it shifts load from users to developers. And ultimately, I feel that firstly, MDAnalysis must make it easy for users to get work done.
I don't think that inclusion of sklearn
as a hard dependency is justified at the moment because we are using only a tiny part of it to provide a tiny part of MDAnalysis functionality. If you look at the other hard dependencies, then they are either essential and provide a lot of functionality (numpy
) or small-ish and/or install cleanly (networkx
, GridDataFormats
) or optional to provide very specific functionality (netCDF4
for Amber NCDF format). You could make a case that sklearn
is similar to biopython
, which also only contributes a fraction of functionality (one version of the PDBReader, sequence handling), but at least we never had any problems with biopython
as a dependency as it always builds cleanly in my experience. (But I would like to get rid of that dependency, too, especially with a view towards Py3k!) Basically, I want to avoid one more stumbling block that prevents a user from easily installing the package (because these are the users that never come back and leave no feedback on what went wrong).
I am eager to listen to everyone's opinions.
I agree with @orbeckst about not to depend too much about sklearn
. There is no point to replace the old lib if it works fine. Not all users is expert in Python and not all of them (currently) knows about conda
.
I agree with @orbeckst as well. Although I very much respect scikit-learn and its high development standards, I don't consider it to be low-level enough in the python stack to make sense as a piece of the core of mdanalysis. It provides high-quality implementations of common machine learning algorithms with a consistent API; that is great, but I don't think that fits for this case.
I think a Cythonized KDTree that is easily able to handle periodic boundary conditions is something we would find desirable. That is something we will have to tackle, and I think that's the only reasonable choice.
We've already got scipy as a dependency, and we could use the KDTree there, but @kain88-de has already shown it's slightly slower (http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4)
I didn't see a problem with using sklearn, it looks like installing it has the same dependencies as numpy/scipy, which would fall into @orbeckst 's "install cleanly" clause. I've just tried making a virtual environment and installing sklearn and it looks painless enough?
@dotsdl the periodic KDTree will be a wrapper around a regular KDTree, so I'd rather use an external (and hopefully optimised) solution and wrap around that
I 've looked at benchmark for 'Test finding in radius of time' mda has the lowest time.
On Aug 20, 2015, at 3:36 AM, Richard Gowers notifications@github.com wrote:
We've already got scipy as a dependency, and we could use the KDTree there, but @kain88-de has already shown it's slightly slower (http://nbviewer.ipython.org/gist/kain88-de/3617d6ec20aaf3cb1ef4)
I didn't see a problem with using sklearn, it looks like installing it has the same dependencies as numpy/scipy, which would fall into @orbeckst 's "install cleanly" clause. I've just tried making a virtual environment and installing sklearn and it looks painless enough?
@dotsdl the periodic KDTree will be a wrapper around a regular KDTree, so I'd rather use an external (and hopefully optimised) solution and wrap around that
— Reply to this email directly or view it on GitHub.
The PR #395 is currently using the BioPython KDTree. That solution is slightly slower then the scikit-learn version but that shouldn't be noticed to much unless one is doing a lot of different selections in a loop. (scikit-learn will be overall faster since it builds up the Tree incredibly fast, so a little slower search doesn't matter that much here). As far as Py3 is concerned BioPython supports python 2.6 - 3.4 so we are good there as well.
I personally have scikit-learn installed by default anyway. They offer a huge variety of dimension reduction algorithms and clusterings. But I also understand if you want to keep the dependencies for MDAnalysis light.
With regard to installing things. In my experience pip works pretty great today if you are using it consistently. I install pip packages always only for my user with (--user
) and have any system-wide
installation handled by my distribution package-manager. But I also have seen people who have a messed up environment when they used pip to install packages system-wide and then updated with the distribution package manager. But that's a problem people will have independent of MDAnalysis.
Scipy is not a hard dependency, the core runs without it.
I think a fast grid search would be a viable alternative.
We've already got scipy as a dependency,
I 've looked at benchmark for 'Test finding in radius of time' mda has the lowest time.
Yes, but tree setup time really matters, as @richardjgowers pointed out in an earlier comment.
@kain88-de , using the BioPython KD-tree implementation seems a good compromise. Can you clean up #395 (or submit a new clean PR); at the moment it does not pass the tests.
However, the NeighborSearch
wrapper is being used in MDAnalysis.analysis.density
and other tools dependent on MDAnalysis so I'd like to keep it. (I am mentioning this because in #395 you asked about it).
Only the atomneighbor search is used there. This class still exists but in its own module now. @richardjgowers just has to tell me where that module should ideally live and then I'll fix the failing test.
Keep it in lib, analysis is more for whole tools rather than components.
However, the NeighborSearch wrapper is being used in MDAnalysis.analysis.density and other tools dependent on MDAnalysis so I'd like to keep it. (I am mentioning this because in #395 you asked about it).
Oh I missed analysis.density because It doesn't have unit-tests. I change the API and I'm quite confident that it still works but I can't test that currently. I wanted to quickly run the example given in the doc string but that doesn't work anymore.
Is there something against switching the KDTree library?
Currently you are using a pretty old C implementation. It would for example be possible to use the scikit-learn KDTree.