Updates for rapids 21.12

NVIDIA-Genomics-Research / rapids-single-cell-examples

Examples of single-cell genomic analysis accelerated with RAPIDS

Apache License 2.0

318 stars 68 forks source link

Updates for rapids 21.12 #84

Closed cjnolet closed 2 years ago

cjnolet commented 2 years ago

So far this updates the notebooks to use RAPIDS 21.12. The 1.3M cells UVM notebook is now able to perform PCA on the whole dataset. This also fixes the 1.3M cells CPU notebook, which got broken at some point in a previous release.

avantikalal commented 2 years ago

It's great to see that we can now do full PCA and differential expression on 1.3 million cells!

we should change the text above cell 19 in https://github.com/clara-parabricks/rapids-single-cell-examples/blob/update_rapids_21.12/notebooks/1M_brain_cpu_analysis.ipynb
also remove pca_train_ratio and n_pca_batches from cell 4 in https://github.com/clara-parabricks/rapids-single-cell-examples/blob/update_rapids_21.12/notebooks/1M_brain_gpu_analysis_uvm.ipynb
will we be able to add differential expression to the multi-GPU notebook as well?

cjnolet commented 2 years ago

@avantikalal,

The corrections have been made. I made a push in the meantime, but i"m waiting for the execution of the 1.3M cells CPU notebook to complete, which will take awhile. Aside from that, I've added the differential expression to the multi-GPU notebook but still need to execute it. After those execute, I think the code side will be done and ready to benchmark.

Intron7 commented 2 years ago

There are some errors in the code that could be adressed. The final release of rapids-21.12 uses cupy-9.6.0 this breaks the implementation of cliping after the scale function, since that syntax for clip changed. This applies to most GPU notebooks as far as I can tell (not the 1.3 multi-gpu notebook). In addition to that the 1.3 million multi-gpu notbook has an error in the scale function. In box 16 you zero-center the array with dask_sparse_arr -= mean and after that you clip at 0 and 10 with dask_sparse_arr = dask.array.clip(dask_sparse_arr, 0, 10).persist() setting below mean counts to 0. Another small issue is that the multi-gpu version breaks for arrays with less than 100000 cells in the regess out function. Shall I open a PR or how do you want to adress this?

cjnolet commented 2 years ago

@Intron7 thank you for hunting down these issues. A new PR to this branch would absolutely be welcome!

Intron7 commented 2 years ago

@cjnolet @avantikalal would you guys want to include diffusion maps? I also found an issue with rank_gene_groups for the multi_gpu notebook (I run this with the lung dataset on 2 GPU). I had an issue in line 443 rankings_gene_names.append(var_names[global_indices].to_pandas()) switching to rankings_gene_names.append(var_names.iloc[global_indices].to_pandas()) fixed it. I also fixed this in the PR.

Intron7 commented 2 years ago

I found some other problems with rank_gene_groups. I'm currently working on a fix. groups doesn't work for 2 clusters. In addition to that the cluster names have to be numbers.

cjnolet commented 2 years ago

Sounds great @Intron7! Whenever you have a fix ready, we'd love to get it rolled in.

cjnolet commented 2 years ago

I'm going to try running the CPU notebook one last time and accept the results at this point. I've run it many times now w/ different environments but I cannot seem to reproduce the 2-3h benchmark. It's taking more like 18 hours. Right now the CPU notebook in master is corrupt so I think it's best we get that merged anyways. I'll also start working on a RAPIDS 22.02 update.

cjnolet commented 2 years ago

I think I've finally figured out the issue. It looks like the global scanpy n_jobs setting changed to sc._settings.ScanpyConfig.n_jobs = 16. Finishing the notebook run now and I'll push and merge.