MarioniLab / miloDE

Framework for sensitive DE testing (using neighbourhoods)
Other
57 stars 2 forks source link

Number of specific DE genes higher than number of DE genes? #34

Closed mihem closed 5 months ago

mihem commented 8 months ago

@amissarova I really like this cluster-free approach, however I am not sure if this result is plausible.

image

1) If i understood correctly "number of specific DE genes", means neighborhood specific. Shouldn't these numbers always be smaller than the number of DE genes (which mean number of DE genes in total between conditions, right?)? In the tutorial that's also the case, but not in my dataset. 2) The clusters that have high number of DE genes are very different from the clusters that have number of specific DE genes, which I find confusing. Again, is this a technical problem? In the tutorial the neighborhoods that have high DE genes, also have high specific DE genes.

image

DE testing

  de_stat <- miloDE::de_test_neighbourhoods(
    milo_DE,
    sample_id = "sample",
    design = ~ 0 + level2,
    covariates = c("level2"),
    contrasts = c("level2VN - level2CTRL"),
    output_type = "SCE",
    plot_summary_stat = TRUE,
    layout = "UMAP.SCVI.FULL",
    BPPARAM = NULL,
    verbose = TRUE,
    min_count = 10
  )

Thank you!!

amissarova commented 7 months ago

Hi @mihem , those are good points. While the expectation that those stats should give somewhat coherent results is fairly often happens in practice, it is not always true, and the reason is that different miloDE statistics. Speicifcally:

1) n-DE-genes will calculate number of genes based on p-value corrected across genes (which is BH FDR correction, across all tested genes which is usually about ~10k) 2) n-DE-specific-genes will calculate genes that are z-outliers using p-value corrected across nhoods (which is spatial BH correction, but number of hoods being smaller);

So being said that - it happens that different nhoods are being highlighted. However, i also find it very surprising the results you are observing. What are the settings you use there - what is the z-threshold? Have you dived in individual nhoods with strong signal and tried to understand what are genes that are driving it and why they dont pop up in the n-DE stat?

I would try to decrease z-threshold (for specific DE stat) and see if you are getting more comparable results

mihem commented 7 months ago

Hey @amissarova, thanks for your response!

Ah that makes sense.

I didn't specify the z threshold (following your tutorial), so just used the default, which seems to be ’-3’.

This is the plot with z score of -6

image

And z score of -20:

image

-> How to determine an appropriate z score? I would think that a z score -6 is maybe more plausible since it does not have soo many genes. On the other hand the pattern of neighborhoods with high numbers of specific DE genes (which is the most important result for us) seems similar and still confusing because so different.

How would yo u dig in individual nhoods regarding genes that are driving it?

An attempt would be (please correct me if i am wrong):

stat_de_magnitude |>
  dplyr::arrange(desc(n_specific_DE_genes)) |>
  head()

image

-> so neighborhood 37 seems to have a lot of specific DE genes.

pp_nhoods_37 <-
  de_stat@assays@data@listData$pval_corrected_across_nhoods |>
  data.frame() |>
  select(X37) |>
  rownames_to_column(var = "gene") |>
  tibble() |>
  dplyr::filter(X37 < 0.1)

only gives me 128 genes though? Is there a mistake in this approach?

When i compare this to pval_corrected_across_genes

p_genes_37 <-
  de_stat@assays@data@listData$pval_corrected_across_genes |>
  data.frame() |>
  select(X37) |>
  rownames_to_column(var = "gene") |>
  tibble() |>
  dplyr::filter(X37 < 0.1)

p_nhoods_37 |>
  anti_join(p_genes_37, join_by(gene == gene))

I get 114 genes .. don't see a particular pattern at first glance, but that's fairly difficult (but there should be many more genes)

image

Thanks!!

amissarova commented 7 months ago

Hi @mihem ,

1) How to select appropriate z - Im not entirely sure there is a fully unsupervised approach to do so. As an empirical rule of thumb, z threshold around 3/3.5 will give you a cut-off to select outliers, however, since in your case you observe quite a lot of outliers, it makes sense to play around with it and increase it. To the point, this particular function to select specifically DE genes is intended a rather exploratory tool to select potentially interesting neighbourhoods (and, along it, which genes are being specifically DE in there), in which yo can delve in with a more in depth analysis. Therefore, I dont think that a search of the most optimal z is necessarily a highly important priority - as you can see, changing your z-threshold results in different number of specifically DE genes but roughly same neighbourhood region of interest.

2) This bit of code:

  de_stat@assays@data@listData$pval_corrected_across_nhoods |>
  data.frame() |>
  select(X37) |>
  rownames_to_column(var = "gene") |>
  tibble() |>
  dplyr::filter(X37 < 0.1)

will return how many genes have pval_corrected_across_nhoods < 0.1, but it is not the same as z-scored pval_corrected_across_nhoods is lower than -3. You can have mismathces in both ways: you can have genes with corrected p-value < 0.1, but z-normalised corrected pvalues > -3 and likewise you can have corrected p-value > 0.1 and z-normalised corrected pvalues < -3 (if for this specific gene, in the most neighbourhoods its either not tested for at all or neighbourhoods are not DE).

So if you want to get the genes that are specifically DE, you need to first z-normalise your corrected pvalues and then select genes less than designated threshold.

Here is an approximate protocol that I would use to sort of try and understand what is going on in the neighbourhoods that have a lot of specific DE, but not as much n-DE:

  1. Identify neighbourhoods that have high n-DE-specific (and in your case, rather small n-DE)
  2. Select few top neighbourhoods from that list, easier to start with one (lets call it N1).
  3. Identify what are the genes that are annotated as specifically DE in N1. If you want to get this info, you can adapt some bits of this function: rank_neighbourhoods_by_DE_magnitude.R (dont forget to z-normalise).
  4. Look now into these genes - what is the reason they are specifically DE for N1. More specific sub-questions here:
    • Are these genes being tested in all/most neighbourhoods or only in the few including N1. One possibility (I think this might be going on here) is these neighbourhoods that you detect - they mark some specific sub-CT (in both reference and query, not necessarily DE) and that genes that you detect as specifically DE are the genes that mark this sub-CT and therefore are expressed in those neighbourhoods and not expressed (and therefore not tested for and have p-value NA)
    • Are these genes significantly DE in N1 or no? Coz if no, i think it supports hypothesis above.
    • For each gene, I would just plot UMAP colored with counts for the gene and faceted by condition, and looked into whats happening.

I think thats a good initial analysis to be performed. Lmk if what I said makes sense and if you have more questions!

mihem commented 7 months ago

Thanks, that makes sense. Trying to following your advice:

assay_de_specific <- assay(de_stat, "pval_corrected_across_nhoods")
idx_de_specific <- which(is.na(assay_de_specific))
assay_de_specific[idx_de_specific] <- 1
assay_de_specific_z <- t(apply(assay_de_specific, 1, function(x) {(x - mean(x, na.rm = TRUE))/sd(x, na.rm  = TRUE)}))
head(sort(assay_de_specific_z[,37][assay_de_specific_z[,37] < -3]), 25)

image

and only in neighborhood 37 it's not NA:

assay_de_specific["PPIAL4G",]

image

Plotting it as a feature plot split by condition (with order = TRUE in Seurat, which means that's the cells with an expression are always on top)

image

sorry, didn't know how to plot it by neighboorhood, which would be maybe easier to see then?

-> So if I unserstand this correctly, you are right with your assumption and PPIAL4G is only tested in this specific NH (although in the cell feature pplot it's also more broadyl expressed) , so not sure this nubmer of DE_specific_genes makes sense?

If i try for example HLA-DPB1 (which is one of the top DE if i do pseudobulk of the clusters nmSC where neighborhood 37 is located), it seems to be quite DE in that cluster (although also in other clusters)

image

Thank you!

amissarova commented 7 months ago

Hi @mihem , I'm not sure if there is a question in your last comment? I would think that HLA-DPB1 would also get detected in that subpopulation, but it doesnt have to be specifically DE - since its gonna be detected in multiple neighbourhoods.

I guess my advice is to treat this DE ranking function as exploratory, which sometimes might give interesting direction to pursue, but it seems in your case it ammends to rather nothing or perhaps even a little misleading. Seems in your case those genes are being detected not necessarily because they are DE (although some might), but simple because they are expressed in those neighbourhoods only. In either case, since its an exploratory function, of course, an additional follow up analysis is to be performed (like the one I was suggesting).

Did I miss your question then? If so, can you reinstate it a bit more clearly? thanks,

mihem commented 7 months ago

Thanks for your fast reply @amissarova .

One small question is how to plot counts or gene expression per neighborhood in a unap as you suggested with miloDE?

And a bigger question: if those genes like PPIAL4G are only significant because they are expressed in one neighborhood, wouldn t it make sense to only keep genes that are expressed in 3 or more neighborhoods?

Thanks

mihem commented 7 months ago

Sorry any further thoughts about this @amissarova ?

Thank you

amissarova commented 7 months ago

@mihem ,

  1. there is a function within plotting functions called plot_DE_single_gene. You can change settings to either plot logFC for all neighborhoods or only for significant ones.

  2. This is really up to you, its not within miloDE package to choose which genes to keep overall. If you think you dont want these genes (e.g. expressed in small number of cells), you can filter them out a priori. I cant advise for one or the other, since I think the importance of this filtering depends on downstream biological questions.

mihem commented 7 months ago

Thanks.

Well i would say that the difference between number of DE genes and number of DE specific genes is caused more by a technical issue (the gene only being present in one neighborhood) than biology (being truly DE between conditions), so removing those genes seems plausible. And this is probably true more generally.

Also I don't know how to systematically exlclude those genes that are only present in a neighborhood.

liz-is commented 6 months ago

Hi @amissarova,

First, thanks for the package - it's a cool and useful concept!

I'm having similar results to @mihem, where I have more "specific DE genes" than "DE genes", and some differences in their distribution across neighbourhoods. I'm still optimising the analysis though so I don't have a specific question about my results/interpretation. I just wanted to suggest that it would be really helpful to have a description in the vignette of how the "specific DE genes" are calculated, and why the Z-score of p-values approach is used. I couldn't find an explanation of why this approach is chosen and how to interpret the results in the vignette or in the paper, although I think I understand how to interpret it after reading this thread in detail and looking at the code. I think a bit more explanation of this in the vignette/docs would be helpful for a lot of users!

amissarova commented 5 months ago

Hi @liz-is and @mihem - thanks for your comments,

I took your suggestion and now updated the vignette that provides the description of the relevant function. Hopefully it will help to other users avoid the confusion that you guys faced. I would remove genes on my end tho, @mihem, since it seems it should be specific to a particular question in mind.

Please reopen the issue if there are more questions.