Open Rohit-Satyam opened 2 years ago
Hi Rohit! You can run MELD on each pair of samples independently (i.e. t1_ctr vs t1_ko; t2_ctr vs tr_ko), then compare the magnitude of the MELD likelihoods if they were calculated on the same graph. To know if you can build a single graph, you can run PHATE on the datasets and then examine the amount of overlap between the pairs of samples. If they are totally overlapping, then you're good to go with one graph. Otherwise, I would run MELD start to finish separately on each pair of samples.
adata here is an Annotated Data Matrix (AnnData). https://anndata.readthedocs.io/en/latest/
Please let me know if this is helpful or if you still have questions
Below is my PHATE scatter plot using default parameters when I ran PHATE on all 4 Samples together as a single matrix. Is there a measure of overlap that we can use to determine if we can build a single graph or does it rely on visual inspection of the PHATE plot?
Besides, for AnnData (adata.X
) assay, should I use the raw counts (data
) or square root transformed data (data_sqrt
). I mostly do Single Cell in R so Anndata is new to me. The documentation here says it has to be just a data matrix.
Edit 1: It must be mentioned in the documentation that adata.X
must be libnorm
matrix and not the raw counts. I found it in a issue here
This looks pretty clearly like there's minimal overlap between the T1 vs T2 datasets so I would analyze them separately
Hi Rohit,
I am a co-author on MELD. I want to chime in on your question about quantifying the batch effect. I agree that by visual inspection alone it is clear that T1 and T2 are not well-mixed.
However, there are ways to quantify this without visual inspection. A very simple metric will be a multinomial test. First, treat the probability of a sample being in batch t1 or t2 as a multinomial with a single count. For example, let's say you have 400 t1 and 600 t2 cells. Then, for a given cell, if the batches are distributed uniformly, you have the probabilities in the null hypothesis that p(t1) = 0.4
and p(t2) = 0.6
. Call this the global distribution. Next, for each cell, you essentially check whether its local neighborhood of k
cells resembles that global multinomial - how close are the local batch distributions to the global distribution? This quantifies how "non-uniform" each cell's neighborhood is relative to the global frequency of the batches. For example, if the global probabilities are p(t1) = 0.4
and p(t2) = 0.6
, but the local probabilities for a neighborhood of radius 25 around some cell c
are p(t1 | c25) = 0.9
and p(t2 | c25) = 0.1
, then it's clear that t1
is enriched around c
and the local distribution is very different from the global null. If you want to get a total metric of non-uniformity based on the local cell distributions, you just take some aggregation of the test statistic over the cells, such as the mean, AUC, etc.
The precise test you need is this https://en.wikipedia.org/wiki/Multinomial_test. You can use Pearson's chi-squared test for this.
I think that this or a very similar approach was demonstrated for biological purposes in "A test metric for assessing single-cell RNA-seq batch correction" by Buttner et. al (2019) in Nature Methods. https://www.nature.com/articles/s41592-018-0254-1
Jay
@dburkhardt I split up my assay matrix separately for time points T1 and T2 and I was planning to run a benchmark for parameter search separately on them. However, I got an error curr_mse not defined
. Is it a typo in the tutorial?
Edit 1: Okay, I got it. It's curr_results
as discussed here.
I don't understand why Likelihood estimates for both Control and Knockout are all 1. Is it because of a lack of replicates? I will reiterate how I performed the analysis.
Dropletutils
and Seurat
and performed DE analysis thereafter. I was then asked to find out if the knockout cells are arrested in lifecycle more at time T1 or at Time T2.Is there another metric that I can use to quantify and say the cells at time T1 are more likely to get arrested because of knockout than at time T2??
Hi!! I was able to resolve the above problem now that I have two biological replicates for T1 (now we have restricted our analysis to T1 only). I have the following queries:
EDIT 1: When having a Control-knockout design, we can justify reassignment of some knockout cells transcriptionally as control cells (because some of the cells might not have undergone successful knockout). However, how can we justify some cells in control samples that are assigned knockout sample labels by MELD (they should be 100 percent assigned to control tags).
Seems like rep2 have the least number of cells getting new sample labels (13%) (the likelihood estimates agree with original sample labels) as compared to rep1 (21%). But how to decide sample tags for marginal cases such as a cell with T1_ctrl.avg=0.524 and T1_ko.avg=0.476
. Currently, for such cells, I assign the sample tag as T1_ctrl
.
I've already looked through the clusters and determined which the number of clusters looks best to me.
How did you shortlist what number of clusters look best
to pick in each cluster? For instance, I have the following clusters shown below. The first plot in each line represents the Average likelihood estimate of T1_ko
. Is the following choice correct?
picked_clusters = {
1:2,2:3,3:5}
Also I am a bit lost about the choice of the following parameters in code:
## code block 41
## How do I decide n_clusters parameter, can I use default 10
meld.VertexFrequencyCluster(n_clusters = 3)
## code block 42
## Since I am unable to view the contents of vfc_op_per_cluster, I wish to understand what does code block 42 does and
## the loop is given below
for n in [2,3,4,5]: ##why 2,3,4,5
clusters_by_n[n] = curr_vfc.predict(n)
If you are planning to organize a workshop on MELD, I would like to join. I have been watching the tutorials already present on youtube and I have many questions. How can I join the slack channel of old workshop where other MELD users can help me out?
Hi Developers!!
We have single-cell data for two-time points each containing control and knockout sample (t1_ctr, t1_ko, t2_ctr,tr_ko). We wish to find out how if the knockout cells are arrested in lifecycle more at time T1 or at Time T2. Is this achievable because we have one replicate per time_period?
Also, in the tutorial and the following code chunk:
what is
adata.X
as it is giving an error