Closed Rohit-Satyam closed 8 months ago
Thank you Rohit, I have answered your questions in parts below.
Part 1: 1) You can alter the preprocessing of the input data using utils.py file. 2) scDREAMER takes Ann data object format (scanpy) as input. The output of scDREAMER is a csv file with 10 dim embeddings. There is also a provision to read .h5ad format in R for any downstream analysis. Output csv file can be read separately or merged in Anndata depending on the usecase for the downstream analysis in Python or R.
We are working on readthedocs format and will release a better documentation of scDREAMER soon.
Part 2: Yes, scDREAMER is applicable for experimental data and can be used to integrate multiple timepoints datasets. We can help in interpreting and analysing the results.
Hi @ajitashree. I see 5 files being generated in my current directory ending with latent_matrix_100/150/200/250/299.csv
. Which one should I choose?
I am trying to run the scDREAMER on my normalized data and I get this error again and again and I have checked for NaN
values using different methods
table(is.na(t(as.matrix(GetAssayData(m, "RNA","data")))))
FALSE
294203000
Warning message:
In asMethod(object) :
sparse->dense coercion: allocating vector of size 2.2 GiB
temp=sc.read("adata_bi_norm.h5ad")
np.count_nonzero(np.isnan(temp.X))
Out[11]: 0
temp.X.size - np.count_nonzero(np.isnan(temp.X))
Out[12]:294203000
The error:
2023-07-16 20:11:49.289214: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
reading data
Data set to work on:
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
(55510, 2000)
[[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
...
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]]
(55510, 3)
encoder input shape Tensor("concat:0", shape=(?, 2003), dtype=float32)
WARNING:tensorflow:From /home/subudhak/miniconda3/envs/scdreamer/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
WARNING:tensorflow:From /home/subudhak/miniconda3/envs/scdreamer/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
decoder input shape Tensor("concat_2:0", shape=(?, 13), dtype=float32)
KL gaussian z Tensor("mul_10:0", shape=(?,), dtype=float32)
KL gaussian l Tensor("mul_9:0", shape=(?,), dtype=float32)
WARNING:tensorflow:From /home/subudhak/miniconda3/envs/scdreamer/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
/home/subudhak/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/scDREAMER/scDREAMER/src/model.py:40: RuntimeWarning: divide by zero encountered in log
log_library_size = np.log(np.sum(self.data_train, axis=1))
/home/subudhak/miniconda3/envs/scdreamer/lib/python3.9/site-packages/numpy/core/_methods.py:233: RuntimeWarning: invalid value encountered in subtract
x = asanyarray(arr - arrmean)
Cluster DRA on DataSet adata_bi_norm.h5ad ...
2023-07-16 20:12:00.968281: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
Epoch : [0] , a_loss = nan, d_loss: inf , g_loss: inf, db_loss: nan
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[3], line 19
8 with tf.Session(config = run_config) as sess:
10 dreamer = model.scDREAMER(
11 sess,
12 epoch = 300,
(...)
16 name = name
17 )
---> 19 dreamer.train_cluster()
File ~/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/scDREAMER/scDREAMER/src/model.py:286, in train_cluster(self)
283 if (ep % 50 == 0 and ep >= 100):
284 self.eval_cluster_on_test_(ep)
--> 286 self.eval_cluster_on_test(ep)
File ~/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/scDREAMER/scDREAMER/src/utils.py:249, in eval_cluster_on_test(self, epoch)
246 Ann.obsm['final_embeddings'] = latent_matrix
247 Ann.obs['group'] = labels.astype(str)
--> 249 sc.pp.neighbors(Ann, use_rep = 'final_embeddings') #use_rep = 'final_embeddings'
250 sc.tl.umap(Ann)
251 img = sc.pl.umap(Ann, color = 'group', frameon = False) # cells
File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/scanpy/neighbors/__init__.py:139, in neighbors(adata, n_neighbors, n_pcs, use_rep, knn, random_state, method, metric, metric_kwds, key_added, copy)
137 adata._init_as_actual(adata.copy())
138 neighbors = Neighbors(adata)
--> 139 neighbors.compute_neighbors(
140 n_neighbors=n_neighbors,
141 knn=knn,
142 n_pcs=n_pcs,
143 use_rep=use_rep,
144 method=method,
145 metric=metric,
146 metric_kwds=metric_kwds,
147 random_state=random_state,
148 )
150 if key_added is None:
151 key_added = 'neighbors'
File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/scanpy/neighbors/__init__.py:794, in Neighbors.compute_neighbors(self, n_neighbors, knn, n_pcs, use_rep, method, random_state, write_knn_indices, metric, metric_kwds)
792 X = pairwise_distances(X, metric=metric, **metric_kwds)
793 metric = 'precomputed'
--> 794 knn_indices, knn_distances, forest = compute_neighbors_umap(
795 X, n_neighbors, random_state, metric=metric, metric_kwds=metric_kwds
796 )
797 # very cautious here
798 try:
File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/scanpy/neighbors/__init__.py:305, in compute_neighbors_umap(X, n_neighbors, random_state, metric, metric_kwds, angular, verbose)
301 from umap.umap_ import nearest_neighbors
303 random_state = check_random_state(random_state)
--> 305 knn_indices, knn_dists, forest = nearest_neighbors(
306 X,
307 n_neighbors,
308 random_state=random_state,
309 metric=metric,
310 metric_kwds=metric_kwds,
311 angular=angular,
312 verbose=verbose,
313 )
315 return knn_indices, knn_dists, forest
File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/umap/umap_.py:328, in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose)
325 n_trees = min(64, 5 + int(round((X.shape[0]) ** 0.5 / 20.0)))
326 n_iters = max(5, int(round(np.log2(X.shape[0]))))
--> 328 knn_search_index = NNDescent(
329 X,
330 n_neighbors=n_neighbors,
331 metric=metric,
332 metric_kwds=metric_kwds,
333 random_state=random_state,
334 n_trees=n_trees,
335 n_iters=n_iters,
336 max_candidates=60,
337 low_memory=low_memory,
338 n_jobs=n_jobs,
339 verbose=verbose,
340 compressed=False,
341 )
342 knn_indices, knn_dists = knn_search_index.neighbor_graph
344 if verbose:
File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/pynndescent/pynndescent_.py:722, in NNDescent.__init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, init_dist, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, parallel_batch_queries, verbose)
719 else:
720 copy_on_normalize = False
--> 722 data = check_array(data, dtype=np.float32, accept_sparse="csr", order="C")
723 self._raw_data = data
725 if not tree_init or n_trees == 0 or init_graph is not None:
File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/sklearn/utils/validation.py:921, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
915 raise ValueError(
916 "Found array with dim %d. %s expected <= 2."
917 % (array.ndim, estimator_name)
918 )
920 if force_all_finite:
--> 921 _assert_all_finite(
922 array,
923 input_name=input_name,
924 estimator_name=estimator_name,
925 allow_nan=force_all_finite == "allow-nan",
926 )
928 if ensure_min_samples > 0:
929 n_samples = _num_samples(array)
File ~/miniconda3/envs/scdreamer/lib/python3.9/site-packages/sklearn/utils/validation.py:161, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
144 if estimator_name and input_name == "X" and has_nan_error:
145 # Improve the error message on how to handle missing values in
146 # scikit-learn.
147 msg_err += (
148 f"\n{estimator_name} does not accept missing values"
149 " encoded as NaN natively. For supervised learning, you might want"
(...)
159 "#estimators-that-handle-nan-values"
160 )
--> 161 raise ValueError(msg_err)
ValueError: Input contains NaN.
Hi Rohit,
There seems to be an issue with the input. After preprocessing, it is becoming Nan because of
/home/subudhak/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/scDREAMER/scDREAMER/src/model.py:40: RuntimeWarning: divide by zero encountered in log
log_library_size = np.log(np.sum(self.data_train, axis=1))
This might be because one of the 2000 genes in the input is not expressed for any cells (or its column sum becoming 0), which results in log(0) -> infinity.
Hi @ajitashree. I checked my data for such features and I don't have such genes. I used the following code to make the metrices
mtx <- t(as.matrix(GetAssayData(m, "RNA","data")))
tt <- log(colSums(mtx))
mtx <- mtx[,!is.infinite(tt)]
l <- log(rowSums(mtx))
table(is.infinite(l))
dim(mtx)
write.csv(mtx,"all_rna_norm.csv")
I then use this matrix to make AnnData and the error persists. I also ran the code that was giving the error using np.log(np.sum(self.data_train, axis=1))
and no it doesn't show presence of NaN
value or Inf
value
I can share the h5ad
file via email, if you can look at it?
Sure @Rohit-Satyam, please share the h5ad, I will look at the data. I am also open to meeting online over Zoom.
Have send the files. If there is an access issue, please let me know!! And let's have a meeting over the zoom whenever you find time. I am inclined to find out if scDREAMER is performing better than Seurat integration on our data or not!!
@Rohit-Satyam I am able to access the data, which of the obs (column name) represents the batch key?
Hi I mentioned it in the code I shared via email. I will mention it here too
dreamer = model.scDREAMER(
sess,
epoch = 300,
dataset_name = 'adata_bi_norm.h5ad',
batch = 'batch',
cell_type = 'labels.stage',
name = name
)
scDREAMER's readthedocs is available with details on all the processing steps.
Hi developers
I was going through the tutorials and I realized that though the simplistic implementation of your integration algorithm using
model.scDREAMER
saves a lot of trouble at user end, it does not discuss the preprocessing steps and how the integration results of scDREAMER can be used with other packages such asscanpy
orSeurat
for further analysis such as DE analysis or marker identification. An additional tutorial on how to import the results of scDREAMER in R and manipulate the seurat object by changing the embeddings would be helpful,I would also like to request if it is possible to expand the documentation a little bit such as including function parameter description and which functions should stay default and which should change.
Finally, I would like to ask if this package is suitable to use for experimental design, where we have wild type vs drug treated design at two or more different time points (without replicates). We have some parasite cells harvested at two different time point with each time point having control vs drug treated. We also have reference atlas available for this parasite so we are planning for reference guided integration but we are also thinking to try your unsupervised method as well to see if we find anything new.