Closed xiaodong-yu closed 4 years ago
Hello @xiaodong-yu! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
src/tike/ptycho/ptycho.py
:Line 168:49: E127 continuation line over-indented for visual indent
@xiaodong-yu, Why does as_array_multi_split
divide the scan positions into quadrants if the whole psi
is broadcast to all GPUs? It would be simpler to randomly divide or interlace?
@carterbox , although we copy the whole psi
to all GPUs, each GPU only reconstructs one sub-image of the psi
. It is not worthy of splitting psi
since it is very small, and currently, we haven't applied the optimizations for the inter-GPU communications yet. However, defining the sub-images and exchanging only the sub-image borders are critical for later GPU-GPU data transfer optimization designs via nvlink.
Purpose
In this PR, I apply the multi-GPU ptychography design to tike. This new PR allows uneven partitioning of the input data.
Approach
Using multi-GPU requires manually spliting&destributing data to devices, creating multiple threads, and binding devices. Since
nscan
parameter is removed, it can now conduct uneven partitioning for the input data.Accordingly, each data (e.g.,
psi
) becomes a list ofcupy.ndarray
. The size of each list equal to the number of GPU devices. The splitting and distributing happen atsrc/tike/ptycho/ptycho.py
and the operators are insrc/tike/operators/cupy/operator.py
.Our hybrid parallelization model is applied at
src/tike/opt.py
. I apply gather-scatter mode todir
computing and all-reduce mode toline-search
computing.Creating multiple threads and binding devices happen at
src/tike/operators/cupy/
. Multiple threads are created usingconcurrent.futures.ThreadPoolExecutor
in the entry points (i.e., the functions named asXXX_multi
). Devices are associated usingcupy.cuda.Device
in the device functions (XXX_device
).Note: Each device requires an object of RawKernel. So
cp.RawKernel()
is moved into the class functions.Issue
The current version cannot support the multi-GPU based probe update. In order to pass the tests, it temporarily gathers the chunks in different GPUs to a single GPU and updates the probe. The multi-GPU probe update should be implemented later.
I've had several trials to keep the uniform functions for the numpy and cupy, but find it is too complicated for the multi-GPU case. In such case, it doesn't simply change
np
tocp
, but should explicitly assign GPU devices and manage the GPU streams. I preserve the uniform functions as many as possible, but have to write some functions dedicated to multi-GPU. Any code optimizations could be discussed later.Pre-Merge Checklists
Submitter
yapf
to format python code.Reviewer