MDAnalysis / pmda

Parallel algorithms for MDAnalysis
https://www.mdanalysis.org/pmda/
Other
31 stars 22 forks source link

Any performance comparison between dask and multprocessing? #155

Closed appassionate closed 2 years ago

appassionate commented 2 years ago

Hi, I believe dask have successfully solved the "cross-hosts" problem in a parallel way. In a single cluster (I mean one computer or server), comparing python multiprocessing and pmda using dask, multiprocessing seems to be more straightforward and fast. SO ,is there any comparison of these parallel way? Thanks! : )

orbeckst commented 2 years ago

On a single machine, you can select either one of the dask supported schedulers https://docs.dask.org/en/latest/scheduler-overview.html including multiprocessing by using the standard dask mechanism https://docs.dask.org/en/latest/scheduler-overview.html#configuring-the-schedulers (as described in Parallelization). The default in PMDA is multiprocessing. But you can also use distributed on a single node as described on the Parallelization: dask.distributed page.

The PMDA paper https://conference.scipy.org/proceedings/scipy2019/shujie_fan.html contains data for multiprocessing and distributed. In the analyzed cases, multiprocessing was more efficient for an I/O-dominated task (RMSD) whereas distributed was marginally better for a compute-dominated task such as RDF. I don't know how much these results generalize but I would imagine that either one is a decent choice for a single node.

appassionate commented 2 years ago

On a single machine, you can select either one of the dask supported schedulers https://docs.dask.org/en/latest/scheduler-overview.html including multiprocessing by using the standard dask mechanism https://docs.dask.org/en/latest/scheduler-overview.html#configuring-the-schedulers (as described in Parallelization). The default in PMDA is multiprocessing. But you can also use distributed on a single node as described on the Parallelization: dask.distributed page.

The PMDA paper https://conference.scipy.org/proceedings/scipy2019/shujie_fan.html contains data for multiprocessing and distributed. In the analyzed cases, multiprocessing was more efficient for an I/O-dominated task (RMSD) whereas distributed was marginally better for a compute-dominated task such as RDF. I don't know how much these results generalize but I would imagine that either one is a decent choice for a single node.

Many thanks!