Open Rohit-Satyam opened 1 year ago
Hi Rohit,
Thank you very much for your query. Imputation is not essential, it was used for the datasets as imputation was recommended for those datasets.
Margaret does not perform integration itself. You are free to choose your own integration algorithm to integrate the control and drug treated data first and the resulting embedding can be used as input to Margaret.
Thanks @hamimzafar. I have one more question. The function train_metric_learner
require user to set obsm_data_key
which you guys have set to X_magic_pca
. Now, I realized that ALRA imputation will fit out requirement given the nature of data rather than MAGIC. Now ALRA returns an imputed matrix and I was thinking of using this as an input to train_metric_learner
function.
What's confusing to me is if MAGIC gives us PCA embeddings or imputed matrix (I hope the fit_transform
function of MAGIC also generates an imputed matrix) and if I should run your run_pca
function or PHATE
first on the ALRA imputed matrix and then use those embeddings as imput to train_metric_learner
?
Okay So I did use the ALRA Imputed matrix and I get the following error:
My input matrix dimensions are 48773(cells), 5300(genes)
and is there a way to speed this up?
Hi, from the error logs it looks like the problem happens when computing nearest neighbors using scanpy and could be related to the type of data you are passing to the nearest neighbor method. A plausible solution could be to check the parameters needed by sc.pp.neighbors
in the Scanpy documentation
HI @kpandey008 @hamimzafar
I think it is important to mention in the notebook what kind of data your train_metric_learner
function would ingest because when I used non-negative ALRA values it threw me the abovementioned error (unlike MAGIC which gives both positive and negative values). But when I run the PCA on ALRA imputed data using your custom run_pca
function from utils.util
and then use these PCA scores
(X_pca) as an input to train_metric_learner
, the function run without any error. But I am not sure if this is the right input, is it?
Also, could you please add parameter description for this function coz when I look ?train_metric_learner
, I dont see one. So it's hard for me to understand what possible values could be passed to backend
or nn_kwargs
for instance.
Margaret does not perform integration itself. You are free to choose your own integration algorithm to integrate the control and drug treated data first and the resulting embedding can be used as input to Margaret.
When you say we can use resulting embedding as input to Margaret, which step of your tutorial were you referring to? Embedding computation
step where we use train_metric_learner
function?
Thanks @hamimzafar. I have one more question. The function
train_metric_learner
require user to setobsm_data_key
which you guys have set toX_magic_pca
. Now, I realized that ALRA imputation will fit out requirement given the nature of data rather than MAGIC. Now ALRA returns an imputed matrix and I was thinking of using this as an input totrain_metric_learner
function.What's confusing to me is if MAGIC gives us PCA embeddings or imputed matrix (I hope the
fit_transform
function of MAGIC also generates an imputed matrix) and if I should run yourrun_pca
function orPHATE
first on the ALRA imputed matrix and then use those embeddings as imput totrain_metric_learner
?
Also, kindly reply to this.
Hi, please find the responses to your queries below:
I think it is important to mention in the notebook what kind of data your
train_metric_learner
function would ingest because when I used non-negative ALRA values it threw me the abovementioned error (unlike MAGIC which gives both positive and negative values). But when I run the PCA on ALRA imputed data using your customrun_pca
function fromutils.util
and then use thesePCA scores
(X_pca) as an input totrain_metric_learner
, the function run without any error. But I am not sure if this is the right input, is it?Also, could you please add parameter description for this function coz when I look
?train_metric_learner
, I dont see one. So it's hard for me to understand what possible values could be passed tobackend
ornn_kwargs
for instance.
There is no constraint on the type of data that the method can be applied to. The error you are referring to is most likely related to a scanpy method for computing nearest neighbors. I can work on improving the documentation but that might take some time as I have very limited bandwidth to work on this. Feel free to raise a PR if you would be interested!
Margaret does not perform integration itself. You are free to choose your own integration algorithm to integrate the control and drug treated data first and the resulting embedding can be used as input to Margaret.
When you say we can use resulting embedding as input to Margaret, which step of your tutorial were you referring to?
Embedding computation
step where we usetrain_metric_learner
function?
The metric learning stage in Margaret takes an initial embedding matrix as input and "refines" it to generate a new set of embeddings based on a metric learning loss. In this case, the initial embedding can be one obtained by integration/imputation using ALRA!
What's confusing to me is if MAGIC gives us PCA embeddings or imputed matrix (I hope the
fit_transform
function of MAGIC also generates an imputed matrix) and if I should run yourrun_pca
function orPHATE
first on the ALRA imputed matrix and then use those embeddings as imput totrain_metric_learner
?Also, kindly reply to this.
You don't need to apply MAGIC/PHATE to your data (unless you feel it should be). Margaret can work with any type of data embeddings. In our tutorial, we apply MAGIC to the data as a preprocessing step because it's applied in the original paper which generated the data as a preprocessing step. This would be different for your specific use-case. The same goes with PCA (you can use any other dimensionality reduction method like LLE etc.) All these design choices are left open to the user in Margaret.
Hi @kpandey008. Thanks for your prompt response. When you say
There is no constraint on the type of data that the method can be applied to.
it troubles me since most of the functions specify the kind of data they ingest. I do see you using PCA scores in MARGARET applied to simulated datasets and MARGARET applied to scRNA-seq data for early embryogenesis but using denoised gene expression in MARGARET applied to scRNA-seq data for early human hematopoiesis. But I am still curious if there is an underlying assumption made by this function train_metric_learner
about the data distribution or by the scanpy functions you use within it. If the data type choice is not important, I think it should not affect the results, right? I will to stick to the X_pca
until I get clarity about this.
Good news: About the error that I encountered above, the error disappears when I run it on another machine (strange though it could have been memory issue I guess). However, I still checked your code of util.py
where sc.pp.neighbors
was used and ran the steps one by one (see code below) and the error disappeared (code given below)
## Because I wish to use the ALRA imputed matrix as an input, I choose use_rep="X"
sc.pp.neighbors(adata_ai, use_rep="X", random_state=12345, n_neighbors=50)
sc.tl.leiden(adata_ai, key_added="clusters", random_state=12345)
adata_ai.obs["clusters"] = adata_ai.obs["clusters"].to_numpy().astype(int)
I can work on improving the documentation but that might take some time as I have very limited bandwidth to work on this. Feel free to raise a PR if you would be interested!
I would be happy to work up the documentation if provided necessary support from your end. But given my passing acquaintance with python and your prior engagement, I doubt I would be able to decipher all the errors that occur in future on my own. Nevertheless will try.
In this case, the initial embedding can be one obtained by integration/imputation using ALRA!
I will run PCA on ALRA imputed matrix and then use these PCA scores for now.
Hi Developers I have few queries!!
First, I am trying to use Margaret for Trajectory analysis. I have Normalized and Scaled counts obtained from Seurat for single cells. For our data, we know from other studies that the time point at which these cells were harvested is the time when very few genes are expressed. In Seurat, we see on an average of 25-38 genes expressed per cell and therefore I am skeptical if imputation should be performed or not. Is imputation essential part of your pipeline since I see it in all the three tutorials?
Second, we have a control vs treatment case (without technical or biological replicates) i.e. we have wild type cells and drug treated and we wish to perform trajectory analysis. In tutorials, I didn't see any step of integration before trajectory analysis so I am confused if I have to carry it out separately on control and drug treated data?