AdvancedPhotonSource / tike

Repository for ptychography software
29 stars 15 forks source link

NEW: Add multi-GPU feature to tike ptychography #72

Closed xiaodong-yu closed 4 years ago

xiaodong-yu commented 4 years ago


In this PR, I apply the multi-GPU ptychography design to tike. This new PR allows uneven partitioning of the input data.


Using multi-GPU requires manually spliting&destributing data to devices, creating multiple threads, and binding devices. Since nscan parameter is removed, it can now conduct uneven partitioning for the input data.

Accordingly, each data (e.g., psi) becomes a list of cupy.ndarray. The size of each list equal to the number of GPU devices. The splitting and distributing happen at src/tike/ptycho/ and the operators are in src/tike/operators/cupy/

Our hybrid parallelization model is applied at src/tike/ I apply gather-scatter mode to dir computing and all-reduce mode to line-search computing.

Creating multiple threads and binding devices happen at src/tike/operators/cupy/. Multiple threads are created using concurrent.futures.ThreadPoolExecutor in the entry points (i.e., the functions named as XXX_multi). Devices are associated using cupy.cuda.Device in the device functions (XXX_device).

Note: Each device requires an object of RawKernel. So cp.RawKernel() is moved into the class functions.


The current version cannot support the multi-GPU based probe update. In order to pass the tests, it temporarily gathers the chunks in different GPUs to a single GPU and updates the probe. The multi-GPU probe update should be implemented later.

I've had several trials to keep the uniform functions for the numpy and cupy, but find it is too complicated for the multi-GPU case. In such case, it doesn't simply change np to cp, but should explicitly assign GPU devices and manage the GPU streams. I preserve the uniform functions as many as possible, but have to write some functions dedicated to multi-GPU. Any code optimizations could be discussed later.

Pre-Merge Checklists



pep8speaks commented 4 years ago

Hello @xiaodong-yu! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 168:49: E127 continuation line over-indented for visual indent

Comment last updated at 2020-06-15 20:25:23 UTC
carterbox commented 4 years ago

@xiaodong-yu, Why does as_array_multi_split divide the scan positions into quadrants if the whole psi is broadcast to all GPUs? It would be simpler to randomly divide or interlace?

xiaodong-yu commented 4 years ago

@carterbox , although we copy the whole psi to all GPUs, each GPU only reconstructs one sub-image of the psi. It is not worthy of splitting psi since it is very small, and currently, we haven't applied the optimizations for the inter-GPU communications yet. However, defining the sub-images and exchanging only the sub-image borders are critical for later GPU-GPU data transfer optimization designs via nvlink.