NVIDIA-Genomics-Research / rapids-single-cell-examples

Examples of single-cell genomic analysis accelerated with RAPIDS
Apache License 2.0
324 stars 68 forks source link

RAPIDS implementation of Scanpy rank_genes_groups appears incorrect #29

Open rmovva opened 4 years ago

rmovva commented 4 years ago

I tried running the RAPIDS implementation of rank_genes_groups alongside the Scanpy CPU implementation on the same data matrix, but I'm getting very different results.

Here's my code for the GPU call:

cluster_labels = cudf.Series.from_categorical(adata.obs["louvain"].cat)
var_names = cudf.Series(var_names)
dense_gpu_array = cp.array(adata_raw.X.todense())

scores, names, reference = rapids_scanpy_funcs.rank_genes_groups(
    dense_gpu_array,
    cluster_labels, 
    var_names, 
    n_genes=n_top_diff_peaks, groups='all', reference='rest')

And the CPU call:

adata_raw.obs['louvain'] = adata.obs['louvain'].tolist()
sc.tl.rank_genes_groups(adata_raw, 
                       groupby="louvain", 
                       n_genes=n_top_diff_peaks, 
                       groups='all', 
                       reference='rest',
                       method='logreg'
                       )

When I look at the top differential gene for each cluster, the outputs reported by the GPU and CPU are disjoint. Also, I note that while the CPU output is sorted by score (i.e., the top 50 diff. genes have high scores, and are sorted in decreasing order), the GPU output seems to be unsorted, and some of the scores are very low. My suspicion is that the GPU output isn't actually being properly sorted by logistic regression coefficient, so the output is just some random set of differential genes & their scores instead of the top N.

When I scatterplot the results, the CPU results also seem to make much more sense than the GPU.

avantikalal commented 4 years ago

Possibly relevant cuML issue: https://github.com/rapidsai/cuml/issues/2478

cjnolet commented 4 years ago

. Also, I note that while the CPU output is sorted by score (i.e., the top 50 diff. genes have high scores, and are sorted in decreasing order), the GPU output seems to be unsorted, and some of the scores are very low.

~Are you referring to the resulting scores (.e.g, adata.uns['rank_genes_groups']['scores'])? For me, the output for both CPU and GPU seem to be unsorted.~

Correction: When using penalty='none', the major axis is indeed sorted in both the GPU and CPU notebooks.

While we wait for the release of the fix for rapidsai/cuml#2478, we have a couple options:

  1. Turn off regularization by passing penalty='none' into the rank_genes_groups functions for both CPU and GPU.
  2. Set the C hyper-parameter to the number of elements in X as recommended in rapidsai/cuml#2478.
teju85 commented 3 years ago

hey folks, just curious. Since the above cuML issue has been fixed, did any of you get a chance rerun the code afterwards? Are you still facing this issue?

avantikalal commented 3 years ago

@teju85 thanks for the reminder, we'll check and get back to you.

avantikalal commented 3 years ago

@teju85 We checked this and the problem still exists, despite the cuML bug being resolved. @cjnolet is looking into it.

avantikalal commented 3 years ago

This issue should be resolved now: https://github.com/rapidsai/cuml/issues/3645

Will test and close.