NVIDIA-Genomics-Research / rapids-single-cell-examples

Examples of single-cell genomic analysis accelerated with RAPIDS
Apache License 2.0
318 stars 68 forks source link

[REVIEW] Multi gpu #82

Closed cjnolet closed 3 years ago

cjnolet commented 3 years ago

One of the reasons the current multi-gpu notebook takes over 30min to execute is because of uvm thrashing as the entire 1.3Mx4k matrix is brought back to a single GPU. Ideally, we would never need to do this. Eventually we should redo the plotting that filters cells by specific marker genes so that we can filter the distributed dask array for those relevant cells and plot in batches. For now, I've removed these plotting pieces since they don't add a whole heck of a lot to the end-to-end runtime anyways.

This notebook performs 100% of the preprocessing without having to bring the 1.3M cells local. For some reason I'm encountering a strange error in the cuml distributed PCA so I've also distributed our batch pca training w/ dask. Once the cluster/viz section is fully distributed, we should be able to scale to multiple nodes, since we'll never need to bring the 1.3 cells (or pca reduced cells) to a single GPU.

avantikalal commented 3 years ago

Two questions:

  1. Would it now be feasible to perform PCA on the complete dataset? (Not necessarily saying this is what we should do)
  2. Can we distribute the post-PCA steps as well? E.g. UMAP - https://docs.rapids.ai/api/cuml/stable/api.html#manifold
cjnolet commented 3 years ago

@avantikalal,

  1. Yes, absolutely we should be able to distribute the PCA steps, however I think I found a bug in the current implementation (still not sure the root cause). https://github.com/rapidsai/cuml/issues/4183
  2. Because of 1, I opted to push the distribution of the remaining steps to a future update.

So far I am seeing a nice linear scaling for the preprocessing steps. In the below chart, you can see a speedup of a little over 8x between 8x V100s and a single V100. The ratio of the preprocessing time was also reduced to only 33% of the end-to-end workflow (down from over 75% in the single V100). I suspect this last detail is largely because the UVM on the client GPU is not being oversubscribed by nearly as much (so we aren't having to wait 3-5min on some steps while the UVM moves pages from main memory back to the GPU).

If we remove the T-SNE step, I think we can distribute every step that remains. This would also make the notebook support multi-node environments.

Single-Cell RNA Preprocessing Steps (1)

cjnolet commented 3 years ago

@avantikalal, any objections if I merge this? We definitely want to distribute the post-PCA steps as well but right now the 1.3Mx50 matrix only takes up 260mb. I do think there's perf to be gained there, though, and we can distribute the remaining pieces in a follow-on.