Updating notebooks and conda environments for RAPIDS release 21.08

NVIDIA-Genomics-Research / rapids-single-cell-examples

Examples of single-cell genomic analysis accelerated with RAPIDS

Apache License 2.0

323 stars 68 forks source link

Updating notebooks and conda environments for RAPIDS release 21.08 #78

Closed cjnolet closed 3 years ago

cjnolet commented 3 years ago

We have dropped official support for CUDA 10.x versions in RAPIDS so I've dropped those conda environment files. I will also add new files for CUDA 11.1 and 11.2.

Notable changes since last supported version:

We now have IncrementalPCA which will do batching automatically
PCA now accepts sparse inputs directly
cuml now has StandardScaler, which will perform the mean centering and normalize to unit variance
LogisticRegression now supports sparse inputs. Made a couple small modifications to rank_genes_groups to allow sparse inputs directly.

I'm also finishing up a blog on HDBSCAN which showcases our lung notebook. I can also add HDBSCAN to that notebook in a follow-on PR.

avantikalal commented 3 years ago

@cjnolet

In the last update to 21.06, I found that StandardScaler() works well for 70K cells but is for some reason very slow on 1.3M cells. Therefore I used cupy directly in the 1.3 M cell notebook in cell 10 here: https://github.com/clara-parabricks/rapids-single-cell-examples/blob/master/notebooks/1M_brain_gpu_analysis_uvm.ipynb. If this is still the case (and it seems to be) we should continue to use cupy instead of StandardScaler() in the 1.3M cell notebook.
Is there a reason to continue using utils.pca in the 1.3M cell notebook if the full PCA can be run?
Please update the Dockerfile too.

cjnolet commented 3 years ago

@avantikalal,

Sounds good. I'll revert this in the meantime and open a cuml issue for it
Last night, the PCA was taking so long to compute w/ UVM enabled that I didn't even allow it to finish because once it's oversubscribed by 2x or more and starts thrashing it could take forever, if it completes at all. This morning I reduced the data size and did a test without UVM- it looks like it's still broken, unfortunately. I've updated the cuml issue in the meantime.
I'll update the docker file.

cjnolet commented 3 years ago

@avantikalal, while Updating the docker file I noticed that atacworks has some hard dependencies (for example, on scikit-learn version 0.21.3 here) which are causing some messages like this during the docker build:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dask-cudf 21.8.2 requires cupy-cuda110, which is not installed.
cudf 21.8.2 requires cupy-cuda112, which is not installed.
umap-learn 0.5.1 requires scikit-learn>=0.22, but you have scikit-learn 0.21.3 which is incompatible.
dask-ml 1.9.0 requires scikit-learn>=0.23, but you have scikit-learn 0.21.3 which is incompatible.
dask-cuda 21.8.0 requires numba>=0.53.1, but you have numba 0.52.0 which is incompatible.
cudf 21.8.2 requires numba>=0.53.1, but you have numba 0.52.0 which is incompatible.
atacworks 0.3.4 requires numpy~=1.19.4, but you have numpy 1.21.2 which is incompatible.
atacworks 0.3.4 requires setuptools~=51.1.1, but you have setuptools 57.4.0 which is incompatible.

Are all of those explicit versions necessary for Atacworks to function properly?

avantikalal commented 3 years ago

@cjnolet

I doubt they are absolutely necessary. You can check by running the Example 5 notebook - if it runs normally in the docker container, there is no problem.

cjnolet commented 3 years ago

@avantikalal, I can't seem to find a stable configuration in the Dockerfile for RAPIDS 21.08 that works well given the hard requirements in the atacworks package. I've built the docker container using the RAPIDS 21.08-cuda11.0 container (no other changes to the Dockerfile) and the notebook won't even get past the imports:

ImportError: numpy.core.multiarray failed to import

I tried changing some of the other dependencies but get strange numba errors from cudf. I removed the versions from the requirements.txt in the AtacWorks repository and the notebooks executed successfully. I have the Dockerfile cloning the AtacWorks repository and doing a pip install in place. Should I submit a PR to AtacWorks or should we just depend on my fork until the next release of the AtacWorks package?

The current changes in this branch work, btw, so we could also merge these for now and update the AtacWorks git repository in the Dockerfile once the changes are merged.