Zafar-Lab / Margaret

Metric learning-based graph-partitioned trajectory inference from single-cell data
MIT License
11 stars 4 forks source link

Is Imputation necessary and how Margaret handle control vs treatment scenario? #37

Open Rohit-Satyam opened 1 year ago

Rohit-Satyam commented 1 year ago

Hi Developers I have few queries!!

First, I am trying to use Margaret for Trajectory analysis. I have Normalized and Scaled counts obtained from Seurat for single cells. For our data, we know from other studies that the time point at which these cells were harvested is the time when very few genes are expressed. In Seurat, we see on an average of 25-38 genes expressed per cell and therefore I am skeptical if imputation should be performed or not. Is imputation essential part of your pipeline since I see it in all the three tutorials?

Second, we have a control vs treatment case (without technical or biological replicates) i.e. we have wild type cells and drug treated and we wish to perform trajectory analysis. In tutorials, I didn't see any step of integration before trajectory analysis so I am confused if I have to carry it out separately on control and drug treated data?

hamimzafar commented 1 year ago

Hi Rohit,

Thank you very much for your query. Imputation is not essential, it was used for the datasets as imputation was recommended for those datasets.

Margaret does not perform integration itself. You are free to choose your own integration algorithm to integrate the control and drug treated data first and the resulting embedding can be used as input to Margaret.

Rohit-Satyam commented 1 year ago

Thanks @hamimzafar. I have one more question. The function train_metric_learner require user to set obsm_data_key which you guys have set to X_magic_pca. Now, I realized that ALRA imputation will fit out requirement given the nature of data rather than MAGIC. Now ALRA returns an imputed matrix and I was thinking of using this as an input to train_metric_learner function.

What's confusing to me is if MAGIC gives us PCA embeddings or imputed matrix (I hope the fit_transform function of MAGIC also generates an imputed matrix) and if I should run your run_pca function or PHATE first on the ALRA imputed matrix and then use those embeddings as imput to train_metric_learner?

Rohit-Satyam commented 1 year ago

Okay So I did use the ALRA Imputed matrix and I get the following error: My input matrix dimensions are 48773(cells), 5300(genes) and is there a way to speed this up?

Click here to View Error!!
```python Generating initial clusters --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[115], line 14 11 with warnings.catch_warnings(): 12 # Filter out user warnings from PyTorch about saving scheduler state 13 warnings.simplefilter("ignore") ---> 14 train_metric_learner(adata, n_episodes=5, n_metric_epochs=30, obsm_data_key='After_imput', code_size=10, 15 backend='leiden', device='cpu', save_path='AI_metric', 16 cluster_kwargs={'random_state': 0, 'resolution': 1.0}, nn_kwargs={'random_state': 0, 'n_neighbors': 50}, 17 trainer_kwargs={'optimizer': 'SGD', 'lr': 0.01, 'batch_size': 256} 18 ) File ~/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/Margaret/margaret/utils/util.py:19, in compute_runtime..f(*args, **kwargs) 16 @wraps(func) 17 def f(*args, **kwargs): 18 start_time = time.time() ---> 19 r = func(*args, **kwargs) 20 end_time = time.time() 21 print(f"Runtime for {func.__name__}(): {end_time - start_time}") File ~/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/Margaret/margaret/train_metric.py:37, in train_metric_learner(adata, n_episodes, n_metric_epochs, code_size, obsm_data_key, device, random_state, save_path, backend, nn_kwargs, trainer_kwargs, cluster_kwargs, loss_kwargs) 35 # Generate initial clusters 36 print("Generating initial clusters") ---> 37 communities, score = determine_cell_clusters( 38 adata, 39 obsm_key=obsm_data_key, 40 backend=backend, 41 cluster_key="metric_clusters", 42 nn_kwargs=nn_kwargs, 43 **cluster_kwargs, 44 ) 45 clustering_scores.append(score) 47 # Dataset File ~/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/Margaret/margaret/utils/util.py:19, in compute_runtime..f(*args, **kwargs) 16 @wraps(func) 17 def f(*args, **kwargs): 18 start_time = time.time() ---> 19 r = func(*args, **kwargs) 20 end_time = time.time() 21 print(f"Runtime for {func.__name__}(): {end_time - start_time}") File ~/Documents/zena_scrnaseq_singleR/extra_analysis_June2023/Margaret/margaret/utils/util.py:189, in determine_cell_clusters(data, obsm_key, backend, cluster_key, nn_kwargs, **kwargs) 186 score = None 187 elif backend == "leiden": 188 # Compute nearest neighbors --> 189 sc.pp.neighbors(data, use_rep=obsm_key, **nn_kwargs) 190 sc.tl.leiden(data, key_added=cluster_key, **kwargs) 191 data.obs[cluster_key] = data.obs[cluster_key].to_numpy().astype(np.int_) File ~/miniconda3/envs/margaret/lib/python3.11/site-packages/scanpy/neighbors/__init__.py:139, in neighbors(adata, n_neighbors, n_pcs, use_rep, knn, random_state, method, metric, metric_kwds, key_added, copy) 137 adata._init_as_actual(adata.copy()) 138 neighbors = Neighbors(adata) --> 139 neighbors.compute_neighbors( 140 n_neighbors=n_neighbors, 141 knn=knn, 142 n_pcs=n_pcs, 143 use_rep=use_rep, 144 method=method, 145 metric=metric, 146 metric_kwds=metric_kwds, 147 random_state=random_state, 148 ) 150 if key_added is None: 151 key_added = 'neighbors' File ~/miniconda3/envs/margaret/lib/python3.11/site-packages/scanpy/neighbors/__init__.py:794, in Neighbors.compute_neighbors(self, n_neighbors, knn, n_pcs, use_rep, method, random_state, write_knn_indices, metric, metric_kwds) 792 X = pairwise_distances(X, metric=metric, **metric_kwds) 793 metric = 'precomputed' --> 794 knn_indices, knn_distances, forest = compute_neighbors_umap( 795 X, n_neighbors, random_state, metric=metric, metric_kwds=metric_kwds 796 ) 797 # very cautious here 798 try: File ~/miniconda3/envs/margaret/lib/python3.11/site-packages/scanpy/neighbors/__init__.py:305, in compute_neighbors_umap(X, n_neighbors, random_state, metric, metric_kwds, angular, verbose) 301 from umap.umap_ import nearest_neighbors 303 random_state = check_random_state(random_state) --> 305 knn_indices, knn_dists, forest = nearest_neighbors( 306 X, 307 n_neighbors, 308 random_state=random_state, 309 metric=metric, 310 metric_kwds=metric_kwds, 311 angular=angular, 312 verbose=verbose, 313 ) 315 return knn_indices, knn_dists, forest File ~/miniconda3/envs/margaret/lib/python3.11/site-packages/umap/umap_.py:328, in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, n_jobs, verbose) 325 n_trees = min(64, 5 + int(round((X.shape[0]) ** 0.5 / 20.0))) 326 n_iters = max(5, int(round(np.log2(X.shape[0])))) --> 328 knn_search_index = NNDescent( 329 X, 330 n_neighbors=n_neighbors, 331 metric=metric, 332 metric_kwds=metric_kwds, 333 random_state=random_state, 334 n_trees=n_trees, 335 n_iters=n_iters, 336 max_candidates=60, 337 low_memory=low_memory, 338 n_jobs=n_jobs, 339 verbose=verbose, 340 compressed=False, 341 ) 342 knn_indices, knn_dists = knn_search_index.neighbor_graph 344 if verbose: File ~/miniconda3/envs/margaret/lib/python3.11/site-packages/pynndescent/pynndescent_.py:722, in NNDescent.__init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, init_dist, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, parallel_batch_queries, verbose) 719 else: 720 copy_on_normalize = False --> 722 data = check_array(data, dtype=np.float32, accept_sparse="csr", order="C") 723 self._raw_data = data 725 if not tree_init or n_trees == 0 or init_graph is not None: File ~/miniconda3/envs/margaret/lib/python3.11/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name) 877 array = xp.astype(array, dtype, copy=False) 878 else: --> 879 array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp) 880 except ComplexWarning as complex_warning: 881 raise ValueError( 882 "Complex data not supported\n{}\n".format(array) 883 ) from complex_warning File ~/miniconda3/envs/margaret/lib/python3.11/site-packages/sklearn/utils/_array_api.py:185, in _asarray_with_order(array, dtype, order, copy, xp) 182 xp, _ = get_namespace(array) 183 if xp.__name__ in {"numpy", "numpy.array_api"}: 184 # Use NumPy API to support order --> 185 array = numpy.asarray(array, order=order, dtype=dtype) 186 return xp.asarray(array, copy=copy) 187 else: ValueError: setting an array element with a sequence. The requested array would exceed the maximum number of dimension of 32. ```
kpandey008 commented 1 year ago

Hi, from the error logs it looks like the problem happens when computing nearest neighbors using scanpy and could be related to the type of data you are passing to the nearest neighbor method. A plausible solution could be to check the parameters needed by sc.pp.neighbors in the Scanpy documentation

Rohit-Satyam commented 1 year ago

HI @kpandey008 @hamimzafar

I think it is important to mention in the notebook what kind of data your train_metric_learner function would ingest because when I used non-negative ALRA values it threw me the abovementioned error (unlike MAGIC which gives both positive and negative values). But when I run the PCA on ALRA imputed data using your custom run_pca function from utils.util and then use these PCA scores (X_pca) as an input to train_metric_learner, the function run without any error. But I am not sure if this is the right input, is it?

Also, could you please add parameter description for this function coz when I look ?train_metric_learner, I dont see one. So it's hard for me to understand what possible values could be passed to backend or nn_kwargs for instance.

Rohit-Satyam commented 1 year ago

Margaret does not perform integration itself. You are free to choose your own integration algorithm to integrate the control and drug treated data first and the resulting embedding can be used as input to Margaret.

When you say we can use resulting embedding as input to Margaret, which step of your tutorial were you referring to? Embedding computation step where we use train_metric_learner function?

Rohit-Satyam commented 1 year ago

Thanks @hamimzafar. I have one more question. The function train_metric_learner require user to set obsm_data_key which you guys have set to X_magic_pca. Now, I realized that ALRA imputation will fit out requirement given the nature of data rather than MAGIC. Now ALRA returns an imputed matrix and I was thinking of using this as an input to train_metric_learner function.

What's confusing to me is if MAGIC gives us PCA embeddings or imputed matrix (I hope the fit_transform function of MAGIC also generates an imputed matrix) and if I should run your run_pca function or PHATE first on the ALRA imputed matrix and then use those embeddings as imput to train_metric_learner?

Also, kindly reply to this.

kpandey008 commented 1 year ago

Hi, please find the responses to your queries below:

I think it is important to mention in the notebook what kind of data your train_metric_learner function would ingest because when I used non-negative ALRA values it threw me the abovementioned error (unlike MAGIC which gives both positive and negative values). But when I run the PCA on ALRA imputed data using your custom run_pca function from utils.util and then use these PCA scores (X_pca) as an input to train_metric_learner, the function run without any error. But I am not sure if this is the right input, is it?

Also, could you please add parameter description for this function coz when I look ?train_metric_learner, I dont see one. So it's hard for me to understand what possible values could be passed to backend or nn_kwargs for instance.

There is no constraint on the type of data that the method can be applied to. The error you are referring to is most likely related to a scanpy method for computing nearest neighbors. I can work on improving the documentation but that might take some time as I have very limited bandwidth to work on this. Feel free to raise a PR if you would be interested!

Margaret does not perform integration itself. You are free to choose your own integration algorithm to integrate the control and drug treated data first and the resulting embedding can be used as input to Margaret.

When you say we can use resulting embedding as input to Margaret, which step of your tutorial were you referring to? Embedding computation step where we use train_metric_learner function?

The metric learning stage in Margaret takes an initial embedding matrix as input and "refines" it to generate a new set of embeddings based on a metric learning loss. In this case, the initial embedding can be one obtained by integration/imputation using ALRA!

What's confusing to me is if MAGIC gives us PCA embeddings or imputed matrix (I hope the fit_transform function of MAGIC also generates an imputed matrix) and if I should run your run_pca function or PHATE first on the ALRA imputed matrix and then use those embeddings as imput to train_metric_learner?

Also, kindly reply to this.

You don't need to apply MAGIC/PHATE to your data (unless you feel it should be). Margaret can work with any type of data embeddings. In our tutorial, we apply MAGIC to the data as a preprocessing step because it's applied in the original paper which generated the data as a preprocessing step. This would be different for your specific use-case. The same goes with PCA (you can use any other dimensionality reduction method like LLE etc.) All these design choices are left open to the user in Margaret.

Rohit-Satyam commented 1 year ago

Hi @kpandey008. Thanks for your prompt response. When you say

There is no constraint on the type of data that the method can be applied to.

it troubles me since most of the functions specify the kind of data they ingest. I do see you using PCA scores in MARGARET applied to simulated datasets and MARGARET applied to scRNA-seq data for early embryogenesis but using denoised gene expression in MARGARET applied to scRNA-seq data for early human hematopoiesis. But I am still curious if there is an underlying assumption made by this function train_metric_learner about the data distribution or by the scanpy functions you use within it. If the data type choice is not important, I think it should not affect the results, right? I will to stick to the X_pca until I get clarity about this.

Good news: About the error that I encountered above, the error disappears when I run it on another machine (strange though it could have been memory issue I guess). However, I still checked your code of util.py where sc.pp.neighbors was used and ran the steps one by one (see code below) and the error disappeared (code given below)

## Because I wish to use the ALRA imputed matrix as an input, I choose use_rep="X"
sc.pp.neighbors(adata_ai, use_rep="X", random_state=12345, n_neighbors=50)
sc.tl.leiden(adata_ai, key_added="clusters", random_state=12345)
adata_ai.obs["clusters"] = adata_ai.obs["clusters"].to_numpy().astype(int)

I can work on improving the documentation but that might take some time as I have very limited bandwidth to work on this. Feel free to raise a PR if you would be interested!

I would be happy to work up the documentation if provided necessary support from your end. But given my passing acquaintance with python and your prior engagement, I doubt I would be able to decipher all the errors that occur in future on my own. Nevertheless will try.

In this case, the initial embedding can be one obtained by integration/imputation using ALRA!

I will run PCA on ALRA imputed matrix and then use these PCA scores for now.