gersteinlab / PsychENCODE_SingleCell_Integrative

Integrative Analyses across 388 human brain samples at single-cell resolution
Other
11 stars 2 forks source link

DEG supporting data #5

Open patzaw opened 3 months ago

patzaw commented 3 months ago

Hi,

Sorry to bother you again.

I'm having difficulties interpreting the provided differential expression values.

I took values from the ASD_DEGcombined.csv file, which seem to be coherent with values displayed on the psychscreen the web interface.

However, the expression data from the pseudobulk matrix do not seem to support the differential expression results.

The script below shows how I compile data from your sources and provides 2 examples of genes identified as significantly differentially expressed in astrocytes or microglia of patients with ASD compared to controls. The logCPM boxplots do not seem to support this.

Could you have a look and tell me if I did anything wrong or how I could get DE supporting data?

Thank you so much for your support.

Patrice


library(dplyr)
library(data.table)
library(readr)
library(stringr)
library(ggplot2)
library(ggpubr)

#############################################################################@
## Files and general information ----

## Adapt the following lines to your file organization
data_dir <- "brainSCOPE/integrative_files"
md_file <- file.path(data_dir, "PEC2_sample_metadata.txt")
all_deg_file <- file.path(data_dir, "ASD_DEGcombined.csv")
pseudobulk_dir <- file.path(data_dir, "output-pseudobulk-expr/pseudobulk_expr")

md <- read_tsv(md_file)
all_deg <- read_csv(all_deg_file)
asd_cohorts <- md |> 
  filter(Disorder=="ASD") |> 
  pull(Cohort) |> 
  unique()

#############################################################################@
## DEG in Astro ----

cell_type <- "Astro"

### DEG ----
deg <- all_deg |>
  filter(cell_type==!!cell_type) |> 
  arrange(pvalue)

### Pseudobulk values ----
exp_file <- file.path(
  pseudobulk_dir,
  sprintf("%s.expr.bed.gz", cell_type)
)
ct_exp <- data.table::fread(exp_file)
ga <- ct_exp[,1:6] |> 
  as_tibble()
ct_exp <- ct_exp[,-c(1:6)]
ct_exp <- as.matrix(ct_exp)
rownames(ct_exp) <- ga$gene

### Top DEG ----
deg
#> # A tibble: 19,097 × 9
#>      ...1 gene   baseMean log2FoldChange lfcSE  stat   pvalue     padj cell_type
#>     <dbl> <chr>     <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl> <chr>    
#>  1 131101 NR4A3     44.0           1.74  0.777 308.  7.36e-69 1.41e-64 Astro    
#>  2 131072 AC108…     7.97          1.60  1.03  108.  2.56e-25 2.44e-21 Astro    
#>  3 129438 CD44     216.            0.556 0.626  52.2 4.89e-13 3.11e- 9 Astro    
#>  4 131108 G0S2       8.20          1.78  1.31   29.7 5.00e- 8 2.39e- 4 Astro    
#>  5 112073 TNFAI…    34.5          -3.33  0.697  20.5 5.90e- 6 2.25e- 2 Astro    
#>  6 112164 AC011…    10.9          -1.51  0.363  18.0 2.18e- 5 6.95e- 2 Astro    
#>  7 131088 OXTR      20.2           1.66  0.392  16.9 3.88e- 5 1.06e- 1 Astro    
#>  8 130747 NDRG1     88.9           1.04  0.257  15.8 7.03e- 5 1.66e- 1 Astro    
#>  9 112563 SLC9C2    70.0          -0.980 0.248  15.6 7.81e- 5 1.66e- 1 Astro    
#> 10 130340 IRS2     214.            0.797 0.214  13.8 2.01e- 4 3.83e- 1 Astro    
#> # ℹ 19,087 more rows

### Example ----

goi <- "NR4A3"
p1 <- ct_exp[goi,,drop=FALSE] |> 
  t() |> 
  as_tibble(rownames="Individual_ID") |> 
  setNames(c("Individual_ID", "logCPM")) |> 
  inner_join(md, by="Individual_ID") |> 
  filter(Disorder %in% c("ASD", "Control", "control")) |> 
  ggplot(aes(x=Disorder, y=logCPM, col=Disorder)) +
  geom_boxplot() +
  scale_y_continuous(breaks=-40:40)
p2 <- ct_exp[goi,,drop=FALSE] |> 
  t() |> 
  as_tibble(rownames="Individual_ID") |> 
  setNames(c("Individual_ID", "logCPM")) |> 
  inner_join(md, by="Individual_ID") |> 
  filter(Disorder %in% c("ASD", "Control", "control")) |> 
  filter(Cohort %in% asd_cohorts) |>
  ggplot(aes(x=Cohort, y=logCPM, col=Disorder)) +
  geom_boxplot() +
  scale_y_continuous(breaks=-40:40)
p <- ggarrange(
  p1, p2,
  labels = c("All cohorts", "Cohorts with ASD"),
  common.legend = TRUE,
  legend = "bottom"
)
annotate_figure(p, top = sprintf("%s in %s", goi, cell_type))


#############################################################################@
## DEG in Micro ----

cell_type <- "Micro"

### DEG ----
deg <- all_deg |>
  filter(cell_type==!!cell_type) |> 
  arrange(pvalue)

### Pseudobulk values ----
exp_file <- file.path(
  pseudobulk_dir,
  sprintf(
    "%s.expr.bed.gz",
    paste(cell_type, "PVM", sep=".")
  )
)
ct_exp <- data.table::fread(exp_file)
ga <- ct_exp[,1:6] |> 
  as_tibble()
ct_exp <- ct_exp[,-c(1:6)]
ct_exp <- as.matrix(ct_exp)
rownames(ct_exp) <- ga$gene

### Top DEG ----
deg
#> # A tibble: 17,737 × 9
#>     ...1 gene   baseMean log2FoldChange  lfcSE  stat   pvalue     padj cell_type
#>    <dbl> <chr>     <dbl>          <dbl>  <dbl> <dbl>    <dbl>    <dbl> <chr>    
#>  1 60230 AC009…    85.7          -0.552 0.661  265.  1.24e-59 1.74e-55 Micro    
#>  2 73247 RUNX1    800.            1.02  0.165   37.0 1.17e- 9 8.19e- 6 Micro    
#>  3 72558 SKAP2    262.            0.751 0.128   33.5 6.99e- 9 3.25e- 5 Micro    
#>  4 56895 IQCK      16.9          -1.60  0.301   28.0 1.21e- 7 4.22e- 4 Micro    
#>  5 71490 NUMB     227.            0.502 0.0969  26.6 2.56e- 7 7.15e- 4 Micro    
#>  6 56227 PBX4       3.74         -2.63  0.520   26.0 3.39e- 7 7.77e- 4 Micro    
#>  7 72377 RB1      265.            0.692 0.136   25.7 3.90e- 7 7.77e- 4 Micro    
#>  8 72706 PAG1     365.            0.792 0.157   24.3 8.11e- 7 1.42e- 3 Micro    
#>  9 56175 GALNT…     6.88         -3.36  0.701   23.6 1.22e- 6 1.62e- 3 Micro    
#> 10 73037 IKBIP     25.2           0.912 0.188   23.6 1.22e- 6 1.62e- 3 Micro    
#> # ℹ 17,727 more rows

### Example ----
goi <- "RUNX1"
p1 <- ct_exp[goi,,drop=FALSE] |> 
  t() |> 
  as_tibble(rownames="Individual_ID") |> 
  setNames(c("Individual_ID", "logCPM")) |> 
  inner_join(md, by="Individual_ID") |> 
  filter(Disorder %in% c("ASD", "Control", "control")) |> 
  ggplot(aes(x=Disorder, y=logCPM, col=Disorder)) +
  geom_boxplot() +
  scale_y_continuous(breaks=-40:40)
p2 <- ct_exp[goi,,drop=FALSE] |> 
  t() |> 
  as_tibble(rownames="Individual_ID") |> 
  setNames(c("Individual_ID", "logCPM")) |> 
  inner_join(md, by="Individual_ID") |> 
  filter(Disorder %in% c("ASD", "Control", "control")) |> 
  filter(Cohort %in% asd_cohorts) |>
  ggplot(aes(x=Cohort, y=logCPM, col=Disorder)) +
  geom_boxplot() +
  scale_y_continuous(breaks=-40:40)
p <- ggarrange(
  p1, p2,
  labels = c("All cohorts", "Cohorts with ASD"),
  common.legend = TRUE,
  legend = "bottom"
)
annotate_figure(p, top = sprintf("%s in %s", goi, cell_type))

Created on 2024-07-08 with reprex v2.1.1

prashantemani commented 2 months ago

Hi, Sorry for the delay in responding, but I can check with my colleague who ran the DEG analysis and get back to you. A quick response would be to say that the covariate correction would influence this but I will try to get you a more complete answer.

prashantemani commented 2 months ago

Hi, My colleague Ran Meng provided the following detailed discussion. Please take a look and see if it helps:

  1. Batch effects. A batch effect can arise when directly combining raw data from different batches/cohorts. For example, in the left panel, control data are sourced from all 12 cohorts, but disease data is only collected from a single or a few cohorts, this discrepancy can introduce batch effects. The blue box “Control” in your figure is presumably only from the“Velmeshev_et_al” cohort.
  2. Balancing the number of sample in disease and control. For example, the DevBrain Cohort has 3 control samples and 9 ASD patients. This cohort is not balanced and 3 control samples might not be representative enough.
  3. Normalization and outlier removal. DESeq2 assumes a negative binomial distribution for count matrix. In some low-fraction cell types, certain samples may only have very few cells counted, which should be excluded as outliers. For brain samples, individuals younger than 13 years should also be excluded from the analysis.
  4. Covariates. The DEG calculation needs to include more covariates for correction, which could influence the result. For example, gender, age, and experimental batch (especially when combining data from different cohorts) were all considered.
  5. There is a useful DEG calculation general pipeline with code examples that you can follow and might be useful: [https://hbctraining.github.io/scRNA-seq/lessons/pseudobulk_DESeq2_scrnaseq.html](DESeq2 tutorial)

Please let us know if this discussion clarifies things.

patzaw commented 2 months ago

Hi

Thank you for these clarifications.

I understand the different technical considerations that can influence the differential expression analysis.

It would help if you could share the model you applied to do the analysis. Also, how would you visualize the individual expression data to get a better understanding of the DEG results? I proposed boxplots in my former post, but it seems that those don't work in this case. Do you have a better alternative?

Thank you again for your support.