How to make scDREAMER communicate with Seurat/Scanpy

Rohit-Satyam commented 1 year ago

Hi developers

I was going through the tutorials and I realized that though the simplistic implementation of your integration algorithm using model.scDREAMER saves a lot of trouble at user end, it does not discuss the preprocessing steps and how the integration results of scDREAMER can be used with other packages such as scanpy or Seurat for further analysis such as DE analysis or marker identification. An additional tutorial on how to import the results of scDREAMER in R and manipulate the seurat object by changing the embeddings would be helpful,

I would also like to request if it is possible to expand the documentation a little bit such as including function parameter description and which functions should stay default and which should change.

Finally, I would like to ask if this package is suitable to use for experimental design, where we have wild type vs drug treated design at two or more different time points (without replicates). We have some parasite cells harvested at two different time point with each time point having control vs drug treated. We also have reference atlas available for this parasite so we are planning for reference guided integration but we are also thinking to try your unsupervised method as well to see if we find anything new.

ajitashree commented 1 year ago

Thank you Rohit, I have answered your questions in parts below.

Part 1: 1) You can alter the preprocessing of the input data using utils.py file. 2) scDREAMER takes Ann data object format (scanpy) as input. The output of scDREAMER is a csv file with 10 dim embeddings. There is also a provision to read .h5ad format in R for any downstream analysis. Output csv file can be read separately or merged in Anndata depending on the usecase for the downstream analysis in Python or R.

We are working on readthedocs format and will release a better documentation of scDREAMER soon.

Part 2: Yes, scDREAMER is applicable for experimental data and can be used to integrate multiple timepoints datasets. We can help in interpreting and analysing the results.

Rohit-Satyam commented 1 year ago

Hi @ajitashree. I see 5 files being generated in my current directory ending with latent_matrix_100/150/200/250/299.csv. Which one should I choose?

Rohit-Satyam commented 1 year ago

I am trying to run the scDREAMER on my normalized data and I get this error again and again and I have checked for NaN values using different methods

table(is.na(t(as.matrix(GetAssayData(m, "RNA","data")))))

    FALSE 
294203000 
Warning message:
In asMethod(object) :
  sparse->dense coercion: allocating vector of size 2.2 GiB

temp=sc.read("adata_bi_norm.h5ad")
 np.count_nonzero(np.isnan(temp.X))
Out[11]: 0
temp.X.size - np.count_nonzero(np.isnan(temp.X))
Out[12]:294203000

The error:

2023-07-16 20:11:49.289214: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
reading data
Data set to work on:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(55510, 2000)
[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 ...
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]]
(55510, 3)
encoder input shape  Tensor("concat:0", shape=(?, 2003), dtype=float32)
WARNING:tensorflow:From /home/subudhak/miniconda3/envs/scdreamer/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /home/subudhak/miniconda3/envs/scdreamer/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
decoder input shape  Tensor("concat_2:0", shape=(?, 13), dtype=float32)
KL gaussian z Tensor("mul_10:0", shape=(?,), dtype=float32)
KL gaussian l Tensor("mul_9:0", shape=(?,), dtype=float32)
WARNING:tensorflow:From /home/subudhak/miniconda3/envs/scdreamer/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

/home/subudhak/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/scDREAMER/scDREAMER/src/model.py:40: RuntimeWarning: divide by zero encountered in log
  log_library_size = np.log(np.sum(self.data_train, axis=1))
/home/subudhak/miniconda3/envs/scdreamer/lib/python3.9/site-packages/numpy/core/_methods.py:233: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)
Cluster DRA on DataSet adata_bi_norm.h5ad ... 
2023-07-16 20:12:00.968281: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
Epoch : [0] ,  a_loss = nan, d_loss: inf ,  g_loss: inf,  db_loss: nan
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 19
      8 with tf.Session(config = run_config) as sess:
     10     dreamer = model.scDREAMER(
     11         sess,
     12         epoch = 300,
   (...)
     16         name = name
     17         )
---> 19     dreamer.train_cluster()

File ~/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/scDREAMER/scDREAMER/src/model.py:286, in train_cluster(self)
    283     if (ep % 50 == 0 and ep >= 100):
    284         self.eval_cluster_on_test_(ep)
--> 286 self.eval_cluster_on_test(ep)

File ~/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/scDREAMER/scDREAMER/src/utils.py:249, in eval_cluster_on_test(self, epoch)
    246 Ann.obsm['final_embeddings'] = latent_matrix
    247 Ann.obs['group'] = labels.astype(str)
--> 249 sc.pp.neighbors(Ann, use_rep = 'final_embeddings') #use_rep = 'final_embeddings'
    250 sc.tl.umap(Ann)
    251 img = sc.pl.umap(Ann, color = 'group', frameon = False) # cells

File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/scanpy/neighbors/__init__.py:139, in neighbors(adata, n_neighbors, n_pcs, use_rep, knn, random_state, method, metric, metric_kwds, key_added, copy)
    137     adata._init_as_actual(adata.copy())
    138 neighbors = Neighbors(adata)
--> 139 neighbors.compute_neighbors(
    140     n_neighbors=n_neighbors,
    141     knn=knn,
    142     n_pcs=n_pcs,
    143     use_rep=use_rep,
    144     method=method,
    145     metric=metric,
    146     metric_kwds=metric_kwds,
    147     random_state=random_state,
    148 )
    150 if key_added is None:
    151     key_added = 'neighbors'

File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/scanpy/neighbors/__init__.py:794, in Neighbors.compute_neighbors(self, n_neighbors, knn, n_pcs, use_rep, method, random_state, write_knn_indices, metric, metric_kwds)
    792     X = pairwise_distances(X, metric=metric, **metric_kwds)
    793     metric = 'precomputed'
--> 794 knn_indices, knn_distances, forest = compute_neighbors_umap(
    795     X, n_neighbors, random_state, metric=metric, metric_kwds=metric_kwds
    796 )
    797 # very cautious here
    798 try:

File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/scanpy/neighbors/__init__.py:305, in compute_neighbors_umap(X, n_neighbors, random_state, metric, metric_kwds, angular, verbose)
    301     from umap.umap_ import nearest_neighbors
    303 random_state = check_random_state(random_state)
--> 305 knn_indices, knn_dists, forest = nearest_neighbors(
    306     X,
    307     n_neighbors,
    308     random_state=random_state,
    309     metric=metric,
    310     metric_kwds=metric_kwds,
    311     angular=angular,
    312     verbose=verbose,
    313 )
    315 return knn_indices, knn_dists, forest

File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/umap/umap_.py:328, in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose)
    325     n_trees = min(64, 5 + int(round((X.shape[0]) ** 0.5 / 20.0)))
    326     n_iters = max(5, int(round(np.log2(X.shape[0]))))
--> 328     knn_search_index = NNDescent(
    329         X,
    330         n_neighbors=n_neighbors,
    331         metric=metric,
    332         metric_kwds=metric_kwds,
    333         random_state=random_state,
    334         n_trees=n_trees,
    335         n_iters=n_iters,
    336         max_candidates=60,
    337         low_memory=low_memory,
    338         n_jobs=n_jobs,
    339         verbose=verbose,
    340         compressed=False,
    341     )
    342     knn_indices, knn_dists = knn_search_index.neighbor_graph
    344 if verbose:

File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/pynndescent/pynndescent_.py:722, in NNDescent.__init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, init_dist, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, parallel_batch_queries, verbose)
    719 else:
    720     copy_on_normalize = False
--> 722 data = check_array(data, dtype=np.float32, accept_sparse="csr", order="C")
    723 self._raw_data = data
    725 if not tree_init or n_trees == 0 or init_graph is not None:

File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/sklearn/utils/validation.py:921, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    915         raise ValueError(
    916             "Found array with dim %d. %s expected <= 2."
    917             % (array.ndim, estimator_name)
    918         )
    920     if force_all_finite:
--> 921         _assert_all_finite(
    922             array,
    923             input_name=input_name,
    924             estimator_name=estimator_name,
    925             allow_nan=force_all_finite == "allow-nan",
    926         )
    928 if ensure_min_samples > 0:
    929     n_samples = _num_samples(array)

File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/sklearn/utils/validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    144 if estimator_name and input_name == "X" and has_nan_error:
    145     # Improve the error message on how to handle missing values in
    146     # scikit-learn.
    147     msg_err += (
    148         f"\n{estimator_name} does not accept missing values"
    149         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    159         "#estimators-that-handle-nan-values"
    160     )
--> 161 raise ValueError(msg_err)

ValueError: Input contains NaN.

ajitashree commented 1 year ago

Hi Rohit,

There seems to be an issue with the input. After preprocessing, it is becoming Nan because of

/home/subudhak/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/scDREAMER/scDREAMER/src/model.py:40: RuntimeWarning: divide by zero encountered in log
  log_library_size = np.log(np.sum(self.data_train, axis=1))

This might be because one of the 2000 genes in the input is not expressed for any cells (or its column sum becoming 0), which results in log(0) -> infinity.

Rohit-Satyam commented 1 year ago

Hi @ajitashree. I checked my data for such features and I don't have such genes. I used the following code to make the metrices

mtx <- t(as.matrix(GetAssayData(m, "RNA","data")))
tt <- log(colSums(mtx))
mtx <- mtx[,!is.infinite(tt)]
l <- log(rowSums(mtx))
table(is.infinite(l))
dim(mtx)
write.csv(mtx,"all_rna_norm.csv")

I then use this matrix to make AnnData and the error persists. I also ran the code that was giving the error using np.log(np.sum(self.data_train, axis=1)) and no it doesn't show presence of NaN value or Inf value

I can share the h5ad file via email, if you can look at it?

ajitashree commented 1 year ago

Sure @Rohit-Satyam, please share the h5ad, I will look at the data. I am also open to meeting online over Zoom.

Rohit-Satyam commented 1 year ago

Have send the files. If there is an access issue, please let me know!! And let's have a meeting over the zoom whenever you find time. I am inclined to find out if scDREAMER is performing better than Seurat integration on our data or not!!

ajitashree commented 1 year ago

@Rohit-Satyam I am able to access the data, which of the obs (column name) represents the batch key?

Rohit-Satyam commented 1 year ago

Hi I mentioned it in the code I shared via email. I will mention it here too


dreamer = model.scDREAMER(
        sess,
        epoch = 300,
        dataset_name = 'adata_bi_norm.h5ad',
        batch = 'batch',
        cell_type = 'labels.stage',
        name = name
        )

ajitashree commented 8 months ago

scDREAMER's readthedocs is available with details on all the processing steps.

Zafar-Lab / scDREAMER

How to make scDREAMER communicate with Seurat/Scanpy #2