AdvancedPhotonSource / tike

Repository for ptychography software
http://tike.readthedocs.io
Other
29 stars 15 forks source link

NEW: Enable MPI for lstsq_grad method #133

Closed xiaodong-yu closed 3 years ago

xiaodong-yu commented 3 years ago

Purpose

Add multi-node to lstsq_grad method by using mpi4py.

Approach

Apply MPI operations onto the communication and sync spots in the lstsq_grad method, in addition to the threadpool operations.

Pre-Merge Checklists

Submitter

Reviewer

pep8speaks commented 3 years ago

Hello @xiaodong-yu! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2021-03-26 18:26:05 UTC
xiaodong-yu commented 3 years ago

@xiaodong-yu, I pulled these images from tests/results/lstsq_grad Can you see that the texture of the noise switches half way down the image with 2 processes?

1 process 2 processes 0-phase-mpi-1 0-phase-mpi-2 Something about this implementation is causing this. Any ideas? Do you want a large dataset to test?

@carterbox, I can't clearly see any differences between these two images. Could you please explain it a little bit more to me? Also, is the 1 process image reconstructed by multiple GPUs? I highly suspect that any differences come from the mean issue you mentioned before. I find that the splitting of the test data is not even.

carterbox commented 3 years ago

It's more obvious in this image (the amplitude of the first view reconstructed by the lstsq test). I increased the problem size and increased the number of processes to 4. I'm using a single GPU because the hardware layout shouldn't affect the quality.

0-ampli

You can see there are 4 distinct horizontal stripe regions in the reconstruction. Probably the common probe constraint isn't correctly implemented.

xiaodong-yu commented 3 years ago

It's more obvious in this image (the amplitude of the first view reconstructed by the lstsq test). I increased the problem size and increased the number of processes to 4. I'm using a single GPU because the hardware layout shouldn't affect the quality.

0-ampli

You can see there are 4 distinct horizontal stripe regions in the reconstruction. Probably the common probe constraint isn't correctly implemented.

OK, now I see the problem. I agree the hardware layout won't affect the results. But I don't think the issue comes from this MPI implementation since it doesn't really change the parallelization design. Any algorithmic problem very likely is caused by the multi-device implementation, even within a single node. I'll check the previous PR to see if it has the same issue with single node multi-GPU.

xiaodong-yu commented 3 years ago

@carterbox I revised the code and don't see the issue anymore with the test image. Please have a review and let me know if you still see the problem in your reconstructions.

carterbox commented 3 years ago

I plan to look at this on Thursday or Friday.

carterbox commented 3 years ago

@xiaodong-yu, the stripes still appears for me. I have pushed a commit which changes the ptycho tests so we are looking at the same size image. I run the following test and there is a noticeable line across the middle of the image.

mpirun -n 2 pytest tests/test_ptycho.py -vs -k 'lstsq and not variable'

Using more than two mpi processes causes an exception in mpi4py.

xiaodong-yu commented 3 years ago

@xiaodong-yu, the stripes still appears for me. I have pushed a commit which changes the ptycho tests so we are looking at the same size image. I run the following test and there is a noticeable line across the middle of the image.

mpirun -n 2 pytest tests/test_ptycho.py -vs -k 'lstsq and not variable'

Using more than two mpi processes causes an exception in mpi4py.

I still didn't see the stripe region in the image (2 processes, as attached). Did you refer to the phase images in tests/result/lstsq_grad?

1-phase

carterbox commented 3 years ago

No, I'm looking at the amplitude image.

xiaodong-yu commented 3 years ago

@carterbox Why the amplitude images I found in the folder are the following random images?

1-ampli

carterbox commented 3 years ago

There are two object views and two probes reconstructed, so there should be 8 images:

tests/result/lstsq_grad/0-ampli.png
tests/result/lstsq_grad/0-phase.png
tests/result/lstsq_grad/0-probe-ampli.png
tests/result/lstsq_grad/0-probe-phase.png
tests/result/lstsq_grad/1-ampli.png
tests/result/lstsq_grad/1-phase.png
tests/result/lstsq_grad/1-probe-ampli.png
tests/result/lstsq_grad/1-probe-phase.png

Image 0 has amplitude constant 1. Image 1 has amplitude "coins". Probe 0 has phase "cryptomeria". Probe 1 has phase "bombus".

The amplitude of image 0 isn't supposed to look like "satyre", but the lstsq method isn't very good.

xiaodong-yu commented 3 years ago

@carterbox, the issue is solved. Please try it again.