AdvancedPhotonSource / tike

Repository for ptychography software
http://tike.readthedocs.io
Other
29 stars 15 forks source link

NEW: Enable multi-node processing #113

Closed xiaodong-yu closed 3 years ago

xiaodong-yu commented 3 years ago

Purpose

Leverage mpi4py to reconstruct images using multiple nodes.

Approach

We create a class to wrap the inter-node p2p and collective communications using mpi4py. We then split the data into chunks and distribute them to corresponding nodes. We utilize the functions in our class to realize inter-node communications during ptychography.

Pre-Merge Checklists

Submitter

Reviewer

pep8speaks commented 3 years ago

Hello @xiaodong-yu! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 197:81: E501 line too long (86 > 80 characters)

Comment last updated at 2021-01-22 19:54:19 UTC
xiaodong-yu commented 3 years ago

The purpose of this pull request is to add MPI-based parallelism to pytchography object updates. The approach used is to add an additional object for map-reduce parallelism.

I don't like this approach because it adds a second object whose purpose is to manage parallelism, but we already have an object which manages parallelism. Since both the ThreadPool and MPIComm are implementing map-reduce parallelism, why not create a combined object which manages both threads and nodes together?

Adding MPI-based parallelism also requires a larger discussion about the expected movement of data between nodes. For example, is the data loaded from the disk by one node and then distributed to the others? Or is it read in parallel and then shuffled around into contiguous blocks.

If you would like, we can have a Bluejeans discussion involving also Tekin, to talk about this issue as well as some other problems.

carterbox commented 3 years ago

These are the things that we discussed in our Bluejeans Meeting:

The parallelism behavior: If less processes than views; retain previous behavior where multiple views share a single process and one or more threads. If more processes than views; each view gets multiple processes and one or more threads.

xiaodong-yu commented 3 years ago

These are the things that we discussed in our Bluejeans Meeting:

  • [ ] Combine thread and process parallelism classes into single abstract class composition
  • [ ] Assume the data is pre-split between processes before calling ptycho.reconstruct
  • [ ] Allow for data to be not only within a single view but also along views axis
  • [ ] Add tools for splitting data at load time?

The parallelism behavior: If less processes than views; retain previous behavior where multiple views share a single process and one or more threads. If more processes than views; each view gets multiple processes and one or more threads.

The pull request is updated accordingly.

xiaodong-yu commented 3 years ago

I'm going to run some tests locally with mpi enabled, but also we need some mpi-enabled tests on the CI.

Do you mean we should run CI with mpiexec?

xiaodong-yu commented 3 years ago

There are still unresolved comments from previous reviews, and yes, let's run the relevant tests with mpiexec. For example, add an additional test call like:

mpiexec -n 2 pytest tests/test_ptycho.py -k cgrad

How can I test mpiexec with CI? Do I need to add a test case to test_ptycho.py?

xiaodong-yu commented 3 years ago

Thanks, @xiaodong-yu! Also, I noticed your documentation is getting better; keep it up!

Thanks, @carterbox!