KrishnaswamyLab / MELD

Quantifying experimental perturbations at single cell resolution
Other
105 stars 9 forks source link

Can we use MELD on single replicate #56

Open Rohit-Satyam opened 2 years ago

Rohit-Satyam commented 2 years ago

Hi Developers!!

We have single-cell data for two-time points each containing control and knockout sample (t1_ctr, t1_ko, t2_ctr,tr_ko). We wish to find out how if the knockout cells are arrested in lifecycle more at time T1 or at Time T2. Is this achievable because we have one replicate per time_period?

Also, in the tutorial and the following code chunk:

with Parallel(n_jobs=36) as p:
    for knn in knn_range:
        # doing this outside the parallel loop because building the graph takes the longest
        benchmarker.fit_graph(adata.X, knn=knn)
        print(knn)
        curr_results = p(delayed(simulate_pdf_calculate_likelihood)(benchmarker, seed, beta) \
                                       for seed in range(25) for beta in beta_range)
        curr_results = pd.DataFrame(curr_results, columns = ['MSE', 'seed', 'beta', 'knn'])
        results.append(curr_mse)

what is adata.X as it is giving an error

dburkhardt commented 2 years ago

Hi Rohit! You can run MELD on each pair of samples independently (i.e. t1_ctr vs t1_ko; t2_ctr vs tr_ko), then compare the magnitude of the MELD likelihoods if they were calculated on the same graph. To know if you can build a single graph, you can run PHATE on the datasets and then examine the amount of overlap between the pairs of samples. If they are totally overlapping, then you're good to go with one graph. Otherwise, I would run MELD start to finish separately on each pair of samples.

adata here is an Annotated Data Matrix (AnnData). https://anndata.readthedocs.io/en/latest/

Please let me know if this is helpful or if you still have questions

Rohit-Satyam commented 2 years ago

Below is my PHATE scatter plot using default parameters when I ran PHATE on all 4 Samples together as a single matrix. Is there a measure of overlap that we can use to determine if we can build a single graph or does it rely on visual inspection of the PHATE plot?

Besides, for AnnData (adata.X) assay, should I use the raw counts (data) or square root transformed data (data_sqrt). I mostly do Single Cell in R so Anndata is new to me. The documentation here says it has to be just a data matrix.

Edit 1: It must be mentioned in the documentation that adata.X must be libnorm matrix and not the raw counts. I found it in a issue here

temp1

dburkhardt commented 2 years ago

This looks pretty clearly like there's minimal overlap between the T1 vs T2 datasets so I would analyze them separately

stanleyjs commented 2 years ago

Hi Rohit,

I am a co-author on MELD. I want to chime in on your question about quantifying the batch effect. I agree that by visual inspection alone it is clear that T1 and T2 are not well-mixed.

However, there are ways to quantify this without visual inspection. A very simple metric will be a multinomial test. First, treat the probability of a sample being in batch t1 or t2 as a multinomial with a single count. For example, let's say you have 400 t1 and 600 t2 cells. Then, for a given cell, if the batches are distributed uniformly, you have the probabilities in the null hypothesis that p(t1) = 0.4 and p(t2) = 0.6. Call this the global distribution. Next, for each cell, you essentially check whether its local neighborhood of k cells resembles that global multinomial - how close are the local batch distributions to the global distribution? This quantifies how "non-uniform" each cell's neighborhood is relative to the global frequency of the batches. For example, if the global probabilities are p(t1) = 0.4 and p(t2) = 0.6, but the local probabilities for a neighborhood of radius 25 around some cell c are p(t1 | c25) = 0.9 and p(t2 | c25) = 0.1 , then it's clear that t1 is enriched around c and the local distribution is very different from the global null. If you want to get a total metric of non-uniformity based on the local cell distributions, you just take some aggregation of the test statistic over the cells, such as the mean, AUC, etc.

The precise test you need is this https://en.wikipedia.org/wiki/Multinomial_test. You can use Pearson's chi-squared test for this.

I think that this or a very similar approach was demonstrated for biological purposes in "A test metric for assessing single-cell RNA-seq batch correction" by Buttner et. al (2019) in Nature Methods. https://www.nature.com/articles/s41592-018-0254-1

Jay

Rohit-Satyam commented 2 years ago

@dburkhardt I split up my assay matrix separately for time points T1 and T2 and I was planning to run a benchmark for parameter search separately on them. However, I got an error curr_mse not defined. Is it a typo in the tutorial?

Edit 1: Okay, I got it. It's curr_results as discussed here.

Rohit-Satyam commented 2 years ago

I don't understand why Likelihood estimates for both Control and Knockout are all 1. Is it because of a lack of replicates? I will reiterate how I performed the analysis.

  1. We have single-cell RNAseq data obtained from the lifecycle of a pathogen. The cells were harvested from two-time points in the lifecycle T1 and T2 (each time point contains a control and knockout sample). Thus we have 4 samples.
  2. I performed the preprocessing using Dropletutils and Seurat and performed DE analysis thereafter. I was then asked to find out if the knockout cells are arrested in lifecycle more at time T1 or at Time T2.
  3. I, therefore, wish to use MELD to obtain Likelihood Estimates. I split the assay and metadata into T1_data and T2_data, ran PHATE and Parameter search and MELD separately. I ran MELD on Raw Counts matrix (not on sqrt normalized). But I get the likelihood estimates of 1 for both control and knockout.

Is there another metric that I can use to quantify and say the cells at time T1 are more likely to get arrested because of knockout than at time T2??

image image

Rohit-Satyam commented 2 years ago

Hi!! I was able to resolve the above problem now that I have two biological replicates for T1 (now we have restricted our analysis to T1 only). I have the following queries:

EDIT 1: When having a Control-knockout design, we can justify reassignment of some knockout cells transcriptionally as control cells (because some of the cells might not have undergone successful knockout). However, how can we justify some cells in control samples that are assigned knockout sample labels by MELD (they should be 100 percent assigned to control tags).

  1. Now that I have Average Likelihood Estimates for both Control and Knockout Samples, can I assign new sample labels for each cell based on these probabilities and then check how many cells have new sample labels (as a measure of QC) which replicate is good? For example something like this: image

Seems like rep2 have the least number of cells getting new sample labels (13%) (the likelihood estimates agree with original sample labels) as compared to rep1 (21%). But how to decide sample tags for marginal cases such as a cell with T1_ctrl.avg=0.524 and T1_ko.avg=0.476. Currently, for such cells, I assign the sample tag as T1_ctrl.

  1. Also in the tutorial you say

    I've already looked through the clusters and determined which the number of clusters looks best to me.

How did you shortlist what number of clusters look best to pick in each cluster? For instance, I have the following clusters shown below. The first plot in each line represents the Average likelihood estimate of T1_ko. Is the following choice correct?

picked_clusters = {
    1:2,2:3,3:5}

Also I am a bit lost about the choice of the following parameters in code:

## code block 41
## How do I decide n_clusters parameter, can I use default 10
meld.VertexFrequencyCluster(n_clusters = 3)

## code block 42
## Since I am unable to view the contents of vfc_op_per_cluster, I wish to understand what does code block 42 does and 
## the loop is given below

for n in [2,3,4,5]:                ##why 2,3,4,5
        clusters_by_n[n] = curr_vfc.predict(n)

image

Rohit-Satyam commented 2 years ago

If you are planning to organize a workshop on MELD, I would like to join. I have been watching the tutorials already present on youtube and I have many questions. How can I join the slack channel of old workshop where other MELD users can help me out?