cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 50 forks source link

Dask implementation of pairwise distance #48

Open alimanfoo opened 8 years ago

alimanfoo commented 8 years ago

Investigate a dask-based implementation of pairwise distance computations.

May also be worth revisiting the existing chunked implementation to offer the alternative of sum or mean for chunk reduction.

jakirkham commented 7 years ago

Just recently got finished putting together dask-distance, which basically mirrors scipy.spatial.distance in terms of API, but performs all computations with Dask Arrays. More details of what is provided in the docs. It has a pdist function that does compute pairwise distances. There is a neat trick that ensures we avoid computing lots of duplicate values. Can inspect the graphs of cdist and pdist to see that it is doing the right thing. That said, chunk size will impact how performant it is. If the chunk size over points is a single point, then it will be optimal. Groups of points get some spillage due to chunks on the diagonal. Though the user can easily affect this by changing the chunking going in.

alimanfoo commented 7 years ago

Awesome news, thanks for letting us know.

On Sat, 30 Sep 2017 at 03:46, jakirkham notifications@github.com wrote:

Just recently got finished putting together dask-distance https://github.com/jakirkham/dask-distance, which basically mirrors scipy.spatial.distance https://docs.scipy.org/doc/scipy-0.19.1/reference/spatial.distance.html in terms of API, but performs all computations with Dask Arrays. More details of what is provided in the docs https://dask-distance.readthedocs.io. It has a pdist https://dask-distance.readthedocs.io/en/latest/dask_distance.html#dask_distance.pdist function that does compute pairwise distances. There is a neat trick https://github.com/jakirkham/dask-distance/blob/v0.1.0/dask_distance/__init__.py#L153 that ensures we avoid computing lots of duplicate values. Can inspect the graphs of cdist https://dask-distance.readthedocs.io/en/latest/dask_distance.html#dask_distance.cdist and pdist https://dask-distance.readthedocs.io/en/latest/dask_distance.html#dask_distance.pdist to see that it is doing the right thing. That said, chunk size will impact how performant it is. If the chunk size over points is a single point, then it will be optimal. Groups of points get some spillage due to chunks on the diagonal. Though the user can easily affect this by changing the chunking going in.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/48#issuecomment-333277123, or mute the thread https://github.com/notifications/unsubscribe-auth/AAq8QsHHGRj3p9vxYxE-19_bq_56oVonks5snat4gaJpZM4Gw6Pb .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health http://cggh.org Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

jakirkham commented 7 years ago

No problem. If run into any issues, please let us know. 🙂