Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph
https://bluefog-lib.github.io/bluefog/
Apache License 2.0
291 stars 71 forks source link

Is it possible to run more agents than the number of my CPU cores? #113

Closed 1qzhworld closed 2 years ago

1qzhworld commented 2 years ago

This is a good job for the parallel computation and large-scale implementation. However, I only have 4 CPU cores on my laptop. Is that possible to run, e.g., 50 agents in my laptop?

If that is possible, how to setting that?

In addition, by runing the examples in Jupyter, I find that I have to run ibfrun start -np 4 first. And for the scripts without ibfrun start -np 4 running print(bf.size()) results in 1.

BichengYing commented 2 years ago

We are happy to hear you like this project!

  1. About the running processes more than the physical cores, it is possible. You just need to run something like this example

    bfrun -np 8 --extra-mpi-flags="--oversubscribe" python -c "import bluefog.torch as bf;bf.init();print(bf.rank())"

    However, because it exceeds the physical core, you may encounter some overhead slowness due to resource competition. It is out of our control since we relied on the MPI implementation. Alternatively, we are developing bluefog-lite. It just used the pure python implementation. The benefit is it can be faster if you need many more virtual nodes than physical but the downside is it is slower than MPI implementation. If you are interested please take a try. It is still a work in progress. We only support some minimum communication operators. Check out this simple consensus example

  2. For the Jupyter notebook, yes it is necessary. As we showed in this illustration figure, if you don't create extra nodes (processes) running behind first, Jupiter notebook cannot automatically spawn the workers image

1qzhworld commented 2 years ago

Thank you for the clear reply.

It works with your code. But just as you said, the MPI costs much more time than computations when the number of agent is oversubscribe. Anyway, this repo should greatly reduce the time for implementations. Thanks for the contribution. I will follow both repositories and recommand to my colleagues.

Best regards.