biobakery / phylophlan

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
https://huttenhower.sph.harvard.edu/phylophlan
MIT License
128 stars 33 forks source link

Scheduler system instead of mulitprocessing #2

Open alexhbnr opened 4 years ago

alexhbnr commented 4 years ago

Hi everyone,

First of all, thanks for keeping up the great work. I was wondering about the design philosophy that you chose for executing the PhyloPhlAn commands and whether using the Python module multiprocessing is the best way to go for very large (> 10,000 species) data sets?

While using multiprocessing's functions allows to distribute the individual task across multiple cores of the same computational node, it is restricted by the size of the current node. In the environments I am currently working at, we typically have a large number of nodes (> 50) that are relatively small (on average 36 cores) connected by a scheduling system. For my current system, PhyloPhlAn would substantially benefit if the individual tasks were submitted to different nodes, rather than run on a single node.

I think the main functionality of PhyloPhlan (function standard_phylogeny_reconstruction from phylophlan.py) could be easily replaced by a pipeline constructed for being run using Snakemake or Nextflow. Especially Snakemake should be an easy port because it uses Python for configuration. Using such a pipeline would then be suitable for both scenarios, either one big node with many cores or many small nodes with fewer cores, because one could decide whether to run the pipeline locally or using a scheduling system, e.g. SLURM.

I was wondering whether you already had thought about it and, if yes, what your design decision against it has been?

Thanks, Alex