Cross-species/dataset comparison

Hi,

I don't think I've come across any threads for this. I was wondering if there are any tutorials/protocols for cross-species clustering conducted in the manuscript. I wanted to test this for multiple datasets with the same species and cross-species as well.

Please correct me if I am wrong, for this type of analysis (based on the manuscript):

Regulons are first calculated for individual datasets.
All datasets are merged into one expression matrix made up of the intersection of genes (or homologues) across datasets .
Regulons calculated from each of the datasets are combined (incase of cross-species converted to homologues)
Each cell is then tested for enrichment of all these regulons using AUCell in the merged expression matrix.

If this is the procedure, I was wondering if there are any functions for steps 3:

To combine regulons from multiple datasets e.g. currently I have separate csv files of regulons for each dataset.
For cross-species comparison to convert genes of regulons to the homologues of the reference species e.g. if comparing human with mouse, regulons constructed in mouse would have their gene names converted to human homologues so that they could be combined with human regulon, all of which would be tested for enrichment using AUCell in each cell of the merged human + mouse expression matrix.

Thanks!

Hi @anirudhpatir

The steps you described are almost the same as the steps that were done in the manuscript.

Only step 3 was not done, the regulons were not combined only the gene names were converted to their homolog gene name.

I cite from the methods

"The 259 human regulons from the Darmanis et al.12 data set and the human homologs of the mouse regulons were evaluated on this merged matrix to obtain the binary regulon activity containing 410 regulons."

For example if you look at this figure (from the manuscript), both the mouse and human DLX1 regulon are shown seperatly (they are in fact two seperate regulons indeed). .

As for the question on how to convert gene names to their homologs name in a different species. We don't provide a script for that. But this information is stored in gene trees databases for example in Ensembl (for more info see: https://www.ensembl.org/Help/View?id=137) and can be accessed using biomart. Biomart has an api which can be accessed both in R (biomaRt: https://bioconductor.org/packages/release/bioc/html/biomaRt.html) and python (pybiomart: https://github.com/jrderuiter/pybiomart).

When you start this analysis you'll quickly notice that there not always a one to one mapping of genes in different species (one human gene might map to multiple genes in mouse and vice verca). You could opt to only stick to the one-to-one mapping genes or use a more complicated approach, that's op to you.

I hope this helps.

Best,

Seppe

Below you can find the actual code which was used to generate these figures in the manuscript.

Note! This code is very old, so might be a bit outdated.


#####################################Create figures for Human Brain data ############################################

date()
setwd("/media/seq-srv-06/lcb/cbravo/HumanBrain_2.0")
load("new_esetHumanBrain_SC.RData")
setwd("/media/seq-srv-06/lcb/cbravo/Comparison_HM_all")
library(SCENIC)
library(data.table)

#####################################Comparison#############################################

#1. Load data

load("/media/seq-srv-06/lcb/saibar/1.scRNAseq/2016-04_SCENIC/00_WorkflowApplication/SCENIC_Workflow_MouseBrain/int/4.1_tsneModuleMatrix_PCA.RData")
Mousetsne <- tsneModuleMatrixPCA
rm(tsneModuleMatrixPCA)
load("/media/seq-srv-06/lcb/cbravo/HumanBrain_2.0/Processing/int/4.2_tsneAUC_PCA.RData")
Humantsne <- tsneAUCPCA
rm(tsneAUCPCA)
load("/media/seq-srv-06/lcb/cbravo/HumanBrain_2.0/Figure production/HumanVSMouse_Data/Mouse_cellsAUC.RData")
MouseAUC <- cellsAUC
rm(cellsAUC)
load("/media/seq-srv-06/lcb/cbravo/HumanBrain_2.0/Figure production/HumanVSMouse_Data/Human_cellsAUC.RData")
HumanAUC <- cellsAUC
rm(cellsAUC)
load("/media/seq-srv-06/lcb/saibar/1.scRNAseq/2016-04_SCENIC/00_WorkflowApplication/SCENIC_Workflow_MouseBrain/int/3.1_tfModules.RData")
MousetfModules <- tfModules
rm(tfModules)
load("/media/seq-srv-06/lcb/cbravo/HumanBrain_2.0/AUCell/int/3.1_tfModules.RData")
HumantfModules <- tfModules
rm(tfModules)

#2. Check with TFs have homologs for each

library(biomaRt)
human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse <- useMart("ensembl", dataset = "mmusculus_gene_ensembl")

Mousetfs <- gsub( "\\s\\(\\d+g)","",colnames(MouseAUC))
Humantfs <- gsub( "\\s\\(\\d+g)","",colnames(HumanAUC))
Homog_tfs_MH <- getLDS(attributes = "mgi_symbol", filters = "mgi_symbol", values = Mousetfs, mart = mouse, attributesL = "hgnc_symbol", martL = human)
Homog_tfs_MH <- Homog_tfs_MH[-c(74, 13),] #Taf1 and Zfp160 have more than one homolog.
rownames(Homog_tfs_MH) <- Homog_tfs_MH[,1]
Homog_tfs_HM <- getLDS(attributes = "hgnc_symbol", filters = "hgnc_symbol", values = Humantfs, mart = human, attributesL = "mgi_symbol", martL = mouse)
#Homog_tfs_HM <- Homog_tfs_HM[-c(93,30,31,32,33,34,35,36,37,78,79,80,123,124,42,43,69),]
Homog_tfs_HM <- Homog_tfs_HM[-c(75,93,26:33,63,64,65,114,115,48,49),]
#SSX3,GTF3C2,ZFY and ZNF354A have more than one homolog.
#Homog_tfs_HM[Homog_tfs_HM[,1] == "ZNF354A",]
rownames(Homog_tfs_HM) <- Homog_tfs_HM[,1]
#When we have more than one homolog, we take one as representative.

#3. Create TF modules

# Homolog to mouse modules in human (mouse modules in human)
tfModulesM_H <- list()

for (name in Mousetfs){
  namex <- paste(name, "")
  Mouse_module <- colnames(MouseAUC)[grep(namex, colnames(MouseAUC))]
  Targets_mouse <- MousetfModules[[Mouse_module]]
  M_H_homog <- getLDS(attributes = "mgi_symbol", filters = "mgi_symbol", values = Targets_mouse, mart = mouse, attributesL = "hgnc_symbol", martL = human)
  nrgenes <- paste("(",length(unique(M_H_homog[,2])), sep="")
  nrgenes <- paste(nrgenes, "g)", sep="")                 
  text <- paste(Homog_tfs_MH[name, 2], nrgenes)
  tfModulesM_H[[text]] <- unique(M_H_homog[,2])
}

length(tfModulesM_H) == length(Mousetfs)
names(tfModulesM_H)
save(tfModulesM_H, file="int/tfModulesM_H.RData")

# Homolog to human modules in mouse (human modules in mouse)
tfModulesH_M <- list()

for (name in Humantfs){
  namex <- paste("^", name,sep="")
  namex <- paste(namex, "")
  Human_module <- colnames(HumanAUC)[grep(namex, colnames(HumanAUC))]
  Targets_human <- HumantfModules[[Human_module]]
  H_M_homog <- getLDS(attributes = "hgnc_symbol", filters = "hgnc_symbol", values = Targets_human, mart = human, attributesL = "mgi_symbol", martL = mouse)
  nrgenes <- paste("(",length(unique(H_M_homog[,2])), sep="")
  nrgenes <- paste(nrgenes, "g)", sep="")                 
  text <- paste(Homog_tfs_HM[name, 2], nrgenes)
  tfModulesH_M[[text]] <- unique(H_M_homog[,2])
}

length(tfModulesH_M) == length(Humantfs)
names(tfModulesH_M)
save(tfModulesH_M, file="int/tfModulesH_M.RData")

#3. AUCell
load("/media/seq-srv-06/lcb/cbravo/HumanBrain_2.0/AUCell/int/3.2_cellsRankings.RData")
cellsRankings_human <- cellsRankings
rm(cellsRankings)

load("/media/seq-srv-06/lcb/saibar/1.scRNAseq/2016-04_SCENIC/00_WorkflowApplication/SCENIC_Workflow_MouseBrain/int/3.2_cellsRankings.RData")
cellsRankings_mouse <- cellsRankings
rm(cellsRankings)

AUC_MH <- AUCell.calcAUC(tfModulesM_H, cellsRankings_human, nCores=10)
# Using 10 cores.
# Genes in the gene sets NOT available in the dataset: 
#   FOS (703g):     65 (9% of 703)
# ETS1 (676g):  79 (12% of 676)
# FLI1 (650g):  107 (16% of 650)
# SF1 (607g):   35 (6% of 607)
# FOSB (565g):  58 (10% of 565)
# SOX10 (539g):     36 (7% of 539)
# KMT2A (496g):     41 (8% of 496)
# ELK3 (451g):  40 (9% of 451)
# EGR4 (436g):  38 (9% of 436)
# IRF1 (431g):  74 (17% of 431)
# JUN (374g):   30 (8% of 374)
# NR3C1 (365g):     24 (7% of 365)
# UBE2K (343g):     23 (7% of 343)
# ATF2 (332g):  20 (6% of 332)
# HLF (328g):   19 (6% of 328)
# SOX8 (318g):  18 (6% of 318)
# TEF (317g):   17 (5% of 317)
# NFE2L2 (305g):    55 (18% of 305)
# ELF1 (283g):  37 (13% of 283)
# NFKB1 (287g):     43 (15% of 287)
# BPTF (275g):  17 (6% of 275)
# BCLAF1 (256g):    18 (7% of 256)
# GATA2 (268g):     41 (15% of 268)
# EBF1 (260g):  39 (15% of 260)
# RFX3 (210g):  28 (13% of 210)
# REL (222g):   55 (25% of 222)
# IRF2 (191g):  15 (8% of 191)
# ATF3 (177g):  27 (15% of 177)
# CREB1 (181g):     17 (9% of 181)
# STAT6 (173g):     48 (28% of 173)
# ZIC3 (166g):  10 (6% of 166)
# JUNB (143g):  20 (14% of 143)
# MAF (145g):   25 (17% of 145)
# CEBPB (140g):     41 (29% of 140)
# ID2 (125g):   17 (14% of 125)
# PRDM1 (122g):     16 (13% of 122)
# KLF4 (118g):  12 (10% of 118)
# LEF1 (118g):  7 (6% of 118)
# DLX2 (109g):  6 (6% of 109)
# PRDM16 (112g):    22 (20% of 112)
# IRF9 (101g):  16 (16% of 101)
# GMEB1 (107g):     14 (13% of 107)
# STAT1 (101g):     19 (19% of 101)
# DLX5 (97g):   10 (10% of 97)
# FOXN3 (95g):  10 (11% of 95)
# NFATC2 (94g):     20 (21% of 94)
# FOSL2 (91g):  11 (12% of 91)
# ATF4 (90g):   8 (9% of 90)
# NFIC (102g):  16 (16% of 102)
# E2F1 (86g):   8 (9% of 86)
# FOXO1 (84g):  10 (12% of 84)
# EGR1 (75g):   9 (12% of 75)
# NEUROD1 (76g):    4 (5% of 76)
# RORA (76g):   4 (5% of 76)
# KLF2 (76g):   7 (9% of 76)
# ZBTB7A (73g):     6 (8% of 73)
# DLX1 (68g):   7 (10% of 68)
# ZNF263 (66g):     4 (6% of 66)
# STAT3 (62g):  10 (16% of 62)
# HIF1A (59g):  6 (10% of 59)
# ARNT2 (57g):  2 (4% of 57)
# SOX2 (56g):   6 (11% of 56)
# NFATC1 (55g):     12 (22% of 55)
# ZMAT4 (54g):  4 (7% of 54)
# DEAF1 (52g):  4 (8% of 52)
# NFYB (49g):   3 (6% of 49)
# GABPA (51g):  4 (8% of 51)
# POU3F1 (48g):     2 (4% of 48)
# JUND (44g):   4 (9% of 44)
# ZNF462 (48g):     3 (6% of 48)
# SMC3 (42g):   3 (7% of 42)
# CEBPA (43g):  4 (9% of 43)
# PBX1 (41g):   4 (10% of 41)
# FOXC1 (42g):  3 (7% of 42)
# SOX17 (42g):  3 (7% of 42)
# RORB (38g):   1 (3% of 38)
# TCF3 (36g):   3 (8% of 36)
# BHLHE41 (38g):    4 (11% of 38)
# CHD2 (35g):   2 (6% of 35)
# ELK1 (34g):   4 (12% of 34)
# ETS2 (36g):   3 (8% of 36)
# MAX (49g):    15 (31% of 49)
# SMAD1 (35g):  4 (11% of 35)
# CREB3 (32g):  6 (19% of 32)
# GLI3 (29g):   2 (7% of 29)
# TIA1 (31g):   2 (6% of 31)
# TBR1 (29g):   4 (14% of 29)
# DBP (26g):    1 (4% of 26)
# CEBPZ (24g):  1 (4% of 24)
# NR2E1 (24g):  3 (12% of 24)
# MZF1 (23g):   5 (22% of 23)
# NR1D1 (25g):  4 (16% of 25)
# PKNOX2 (25g):     2 (8% of 25)
# SP1 (25g):    3 (12% of 25)
# SREBF2 (23g):     1 (4% of 23)
# JDP2 (24g):   1 (4% of 24)
# CUX1 (22g):   1 (5% of 22)
# EGR2 (22g):   4 (18% of 22)
# NA (18g):     2 (11% of 18)
# POU2F1 (19g):     5 (26% of 19)
# RELB (21g):   3 (14% of 21)
# ZEB1 (20g):   4 (20% of 20)
# ESRRA (19g):  1 (5% of 19)
# KLF5 (19g):   2 (11% of 19)
# RUNX1 (20g):  4 (20% of 20)
# E2F4 (15g):   1 (7% of 15)
# ERF (16g):    4 (25% of 16)
# MYNN (17g):   4 (24% of 17)
# PBX3 (16g):   3 (19% of 16)
# TEAD1 (16g):  1 (6% of 16)
# THRA (17g):   1 (6% of 17)
# FOXO6 (16g):  2 (12% of 16)
# GLIS2 (17g):  1 (6% of 17)
# HINFP (16g):  2 (12% of 16)
# IKZF2 (16g):  2 (12% of 16)
# TCF4 (14g):   1 (7% of 14)
# ZNF664 (17g):     2 (12% of 17)
# GTF2A1 (13g):     1 (8% of 13)
# SP3 (15g):    1 (7% of 15)
# MKX (14g):    4 (29% of 14)
# POU3F3 (15g):     1 (7% of 15)
# TCF7L2 (14g):     2 (14% of 14)
# NKX2-2 (13g):     1 (8% of 13)
# USF2 (14g):   1 (7% of 14)
# ZNF788 (14g):     4 (29% of 14)
# ACAA1 (13g):  1 (8% of 13)
# HMGA1 (13g):  1 (8% of 13)
# NR1H2 (13g):  3 (23% of 13)
# SP4 (12g):    1 (8% of 12)
# FOXN2 (12g):  1 (8% of 12)
# GTF2IRD1 (11g):   3 (27% of 11)
# EXO5 (11g):   3 (27% of 11)
# MAFK (11g):   3 (27% of 11)
# RFX7 (10g):   1 (10% of 10)
# SMAD3 (9g):   1 (11% of 9)
save(AUC_MH, file="int/cellsAUC_MH.RData")

AUC_HM <- AUCell.calcAUC(tfModulesH_M, cellsRankings_mouse, nCores=10)
# Using 10 cores.
# Genes in the gene sets NOT available in the dataset: 
# Egr1 (781g):  24 (3% of 781)
# Egr3 (518g):  14 (3% of 518)
# Hlf (293g):   3 (1% of 293)
# Mga (221g):   7 (3% of 221)
# Zmat4 (188g):     2 (1% of 188)
# Klf13 (169g):     2 (1% of 169)
# Celf4 (120g):     2 (2% of 120)
# Nfe2l1 (81g):     2 (2% of 81)
# Hnf4a (87g):  17 (20% of 87)
# Esrrg (76g):  2 (3% of 76)
# Hnf4g (75g):  9 (12% of 75)
# Prdm16 (67g):     5 (7% of 67)
# Arid3a (67g):     6 (9% of 67)
# Egr4 (77g):   10 (13% of 77)
# Ppargc1a (64g):   2 (3% of 64)
# Vdr (59g):    3 (5% of 59)
# Melk (59g):   1 (2% of 59)
# Nkx2-5 (57g):     8 (14% of 57)
# Spi1 (60g):   4 (7% of 60)
# Dlx2 (58g):   2 (3% of 58)
# Mafb (57g):   2 (4% of 57)
# Nr2c2 (53g):  2 (4% of 53)
# Tef (65g):    13 (20% of 65)
# Ar (47g):     2 (4% of 47)
# Pgr (51g):    12 (24% of 51)
# Gtf3c2 (40g):     1 (2% of 40)
# Runx1 (38g):  1 (3% of 38)
# Clock (39g):  3 (8% of 39)
# Trp53 (43g):  10 (23% of 43)
# Rel (34g):    2 (6% of 34)
# Bcl6 (32g):   1 (3% of 32)
# Gata3 (36g):  3 (8% of 36)
# Nr1h3 (34g):  3 (9% of 34)
# Yy1 (28g):    1 (4% of 28)
# Irf2 (30g):   3 (10% of 30)
# Elk1 (26g):   1 (4% of 26)
# Gata2 (28g):  5 (18% of 28)
# Tead4 (26g):  1 (4% of 26)
# Gata5 (23g):  4 (17% of 23)
# Nfkb1 (23g):  1 (4% of 23)
# Gata4 (30g):  13 (43% of 30)
# Ppard (24g):  2 (8% of 24)
# Pou5f1 (23g):     2 (9% of 23)
# Foxj3 (23g):  1 (4% of 23)
# Tfcp2l1 (44g):    16 (36% of 44)
# Zfy1 (24g):   5 (21% of 24)
# Arnt (21g):   1 (5% of 21)
# Polr2a (21g):     1 (5% of 21)
# Rxrb (21g):   1 (5% of 21)
# Egr2 (32g):   9 (28% of 32)
# Elf4 (20g):   3 (15% of 20)
# Foxo3 (18g):  1 (6% of 18)
# Grhl2 (18g):  2 (11% of 18)
# Lhx4 (19g):   1 (5% of 19)
# Tbx21 (19g):  3 (16% of 19)
# Zfp777 (18g):     1 (6% of 18)
# Ctcfl (18g):  5 (28% of 18)
# Myb (18g):    1 (6% of 18)
# Pou2f1 (17g):     1 (6% of 17)
# Smad1 (17g):  1 (6% of 17)
# Snai2 (15g):  1 (7% of 15)
# Zfp354a (17g):    2 (12% of 17)
# Mecom (14g):  1 (7% of 14)
# Nfya (16g):   2 (12% of 16)
# Ssx9 (27g):   9 (33% of 27)
# Stat5a (13g):     1 (8% of 13)
# E2f2 (27g):   13 (48% of 27)
# Elf1 (16g):   1 (6% of 16)
# Srebf1 (16g):     2 (12% of 16)
# Dlx1 (13g):   1 (8% of 13)
# Esr2 (16g):   2 (12% of 16)
# Gfi1b (13g):  2 (15% of 13)
# Sirt6 (14g):  1 (7% of 14)
# Figla (9g):   2 (22% of 9)
save(AUC_HM, file="int/cellsAUC_HM.RData")

Mousemod <- data.frame(colnames(MouseAUC), colnames(AUC_MH))
colnames(Mousemod) <- c("Mouse", "Human homolog")
WriteXLS(Mousemod, ExcelFileName="int/MouseModules.xls")

Humanmod <- data.frame(colnames(HumanAUC), colnames(AUC_HM))
colnames(Humanmod) <- c("Human", "Mouse homolog")
WriteXLS(Humanmod, ExcelFileName="int/HumanModules.xls")

#4. Mouse modules plots
rownames(Mousemod) <- Mousemod[,1]
colorPal <- grDevices::colorRampPalette(c("lightgrey", "red","darkred"))

pdf("MouseModules.pdf", width=11,height=5)
par(mfrow=c(1,2))
for (name in rownames(Mousemod)){
  cellColorMouse <- setNames(colorPal(5)[as.numeric(cut((MouseAUC[,name]), breaks=5, right=F,include.lowest=T))], rownames(MouseAUC))
  colMouse <- adjustcolor(cellColorMouse[rownames(Mousetsne$Y)], alpha=0.5)

  name_MH <- as.character(Mousemod[name, 2])
  cellColorHuman <- setNames(colorPal(5)[as.numeric(cut((AUC_MH[,name_MH]), breaks=5, right=F,include.lowest=T))], rownames(AUC_MH))
  colHuman <- adjustcolor(cellColorHuman[rownames(Humantsne$Y)], alpha=0.5)

  plot(Mousetsne$Y, col=colMouse, pch=16, xlab="tsne1", ylab="tsne2", main=name)
  plot(Humantsne$Y, col=colHuman, pch=16, xlab="tsne1", ylab="tsne2", main=name_MH)
}
dev.off()

rownames(Humanmod) <- Humanmod[,1]
pdf("HumanModules.pdf", width=11,height=5)
par(mfrow=c(1,2))
for (name in rownames(Humanmod) ){
  if (sum(HumanAUC[,name]) != 0){
    cellColorHuman <- setNames(colorPal(5)[as.numeric(cut((HumanAUC[,name]), breaks=5, right=F,include.lowest=T))], rownames(HumanAUC))
  }
  else{
    cellColorHuman <- setNames(rep("grey", length(rownames(HumanAUC))), rownames(HumanAUC))
  }
  colHuman <- adjustcolor(cellColorHuman[rownames(Humantsne$Y)], alpha=0.5)

  name_HM <- as.character(Humanmod[name, 2])
  cellColorMouse <- setNames(colorPal(5)[as.numeric(cut((AUC_HM[,name_HM]), breaks=5, right=F,include.lowest=T))], rownames(AUC_HM))
  colMouse <- adjustcolor(cellColorMouse[rownames(Mousetsne$Y)], alpha=0.5)

  plot(Humantsne$Y, col=colHuman, pch=16, xlab="tsne1", ylab="tsne2", main=name)
  plot(Mousetsne$Y, col=colMouse, pch=16, xlab="tsne1", ylab="tsne2", main=name_HM)
}
dev.off()

modules <- Humantfs[which(Humantfs %in% Homog_tfs_MH[,2])] #We can do the same in mouse

pdf("ModulesInBoth.pdf", width=11,height=11)
par(mfrow=c(2,2))
for (name in modules){
  namex <- paste("^", name, sep="")
  namex <- paste(namex, "")
  Human_module <- colnames(HumanAUC)[grep(namex, colnames(HumanAUC))]
  MH_module <- colnames(AUC_MH)[grep(namex, colnames(AUC_MH))]
  name_m <- Homog_tfs_HM[name,2]
  name_m <- paste("^", name_m, sep="")
  name_m <- paste(name_m, "")
  Mouse_module <- colnames(MouseAUC)[grep(name_m, colnames(MouseAUC))]
  HM_module <- colnames(AUC_HM)[grep(name_m, colnames(AUC_HM))]

  if (sum(HumanAUC[,Human_module]) != 0){
    cellColorHuman <- setNames(colorPal(5)[as.numeric(cut((HumanAUC[,Human_module]), breaks=5, right=F,include.lowest=T))], rownames(HumanAUC))
  }
  else{
    cellColorHuman <- setNames(rep("grey", length(rownames(HumanAUC))), rownames(HumanAUC))
  }
  colHuman <- adjustcolor(cellColorHuman[rownames(Humantsne$Y)], alpha=0.5)

  cellColorMH <- setNames(colorPal(5)[as.numeric(cut((AUC_MH[,MH_module]), breaks=5, right=F,include.lowest=T))], rownames(AUC_MH))
  colMH <- adjustcolor(cellColorMH[rownames(Humantsne$Y)], alpha=0.5)

  cellColorMouse <- setNames(colorPal(5)[as.numeric(cut((MouseAUC[,Mouse_module]), breaks=5, right=F,include.lowest=T))], rownames(MouseAUC))
  colMouse <- adjustcolor(cellColorMouse[rownames(Mousetsne$Y)], alpha=0.5)

  cellColorHM <- setNames(colorPal(5)[as.numeric(cut((AUC_HM[,HM_module]), breaks=5, right=F,include.lowest=T))], rownames(AUC_HM))
  colHM <- adjustcolor(cellColorHM[rownames(Mousetsne$Y)], alpha=0.5)

  plot(Humantsne$Y, col=colHuman, pch=16, xlab="tsne1", ylab="tsne2", main=Human_module)
  plot(Mousetsne$Y, col=colHM, pch=16, xlab="tsne1", ylab="tsne2", main=HM_module)
  plot(Mousetsne$Y, col=colMouse, pch=16, xlab="tsne1", ylab="tsne2", main=Mouse_module)
  plot(Humantsne$Y, col=colMH, pch=16, xlab="tsne1", ylab="tsne2", main=MH_module)
}
dev.off()

Hi @SeppeDeWinter,

Thank you for the prompt reply. I have three questions following from this regarding the gene conversion for pyscenic outputs and procedures followed for joint clustering. Question 3 is something I wished to test so I understand if this may not possible to troubleshoot.

Thank you for the code, the conversion of gene names makes sense for examining regulon activity across species. The R data types are definitely easier to work with. I am currently using the python based tutorials (as it seemed faster than the R implementation) and am unsure of how to manipulate the output files from these, in this case the regulon.csv files. I am unsure of how the file is formatted hence don't want to go about changing it. As an example I have added the first two entries from the file. If you could give some guidance on how to convert gene names within this file or if there is some other way, maybe converting it to R compatible object first.

| | Enrichment | Enrichment | Enrichment | Enrichment | Enrichment | Enrichment | Enrichment | Enrichment -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | AUC | NES | MotifSimilarityQvalue | OrthologousIdentity | Annotation | Context | TargetGenes | RankAtMax TF | MotifID | | | | | | | | AU041133 | dbcorrdb__MYC__ENCSR000EGJ_1__m1 | 0.06134957 | 3.06712857 | 0.000225 | 0.47081 | motif similar to transfac_pro__M06039 ('V$ZNF878_02: ZNF878'; q-value = 0.000225) which is annotated for orthologous gene ENSG00000257446 in H. sapiens (identity = 47%) | frozenset({'activating', 'weight>75.0%', 'mm9-500bp-upstream-7species.mc9nr'}) | [('Utp11l', 0.4190139865647753), ('Ppfia4', 0.59622490938284), ('Sugp2', 4.864268360304388), ('Pabpc4', 0.8218154311808581), ('Gprc5c', 0.6436159200634183), ('Rasal3', 0.7505263082542397), ('B3galt2', 1.0929702612979697), ('Usp2', 0.6181841109820803), ('Nr2c2', 0.5166169646063524), ('Ceacam1', 2.6686498467913062), ('Znrf2', 0.4630347193433489), ('Prdx6', 1.3107060488146285), ('Hrsp12', 0.5976996293267264), ('Creld1', 0.4950311219269024), ('Atf1', 0.619279007557783), ('Poglut1', 0.9216056647216148), ('Akr1b3', 0.4407775175009705), ('Pcm1', 0.4845314890416012), ('Bsg', 0.5310128429183464), ('Hmga2', 0.549179397838494), ('Esr2', 0.4999150787651524), ('Rps27a', 0.5823538269371533), ('Akap12', 0.5700757677524456), ('Ier3', 10.064569864928544), ('Egr2', 2.797270011931476), ('Stx18', 0.5949288146950027), ('Kif20a', 1.2278466610463123), ('Iqgap2', 0.6870818592561624), ('Slc37a3', 2.022582946766236), ('Nup85', 1.486792037703781), ('Mtpap', 0.514939998424916), ('Thrap3', 0.43361467678054), ('Arl2', 0.8234712758195339), ('Acvr1b', 0.6375409967168627), ('Chchd5', 0.4983861522005255), ('Trmt61a', 0.5803355011023089), ('Pigv', 0.9641329034044874), ('Ddx17', 0.5778319656378248), ('Paip2', 0.4465254749155481), ('Rgs8', 0.8680721784498453), ('Atxn1l', 0.7457446405851063), ('Entpd1', 0.5684530490183239), ('Mgmt', 0.7636867967415881), ('Wsb1', 0.574940011678569), ('9130401M01Rik', 0.7038900187998567), ('Dohh', 1.5529429460529105), ('Nudt13', 0.5750153491612069), ('Pcnxl2', 0.8940997879634238), ('4933407L21Rik', 0.9370037182704052), ('Tram1', 0.4258458247668389), ('Erlin1', 1.2351225637266905), ('Tmem219', 0.6822851670138835), ('Lmo2', 0.5983497621285764), ('Dcaf8', 0.9124871026174316), ('Mdk', 0.7585350357234537)] | 3979 AU041133 | dbcorrdb__MAX__ENCSR000ECN_1__m1 | 0.06097656 | 3.03346899 | 0.000394 | 0.47081 | motif similar to transfac_pro__M06039 ('V$ZNF878_02: ZNF878'; q-value = 0.000394) which is annotated for orthologous gene ENSG00000257446 in H. sapiens (identity = 47%) | frozenset({'activating', 'weight>75.0%', 'mm9-500bp-upstream-7species.mc9nr'}) | [('Rasal3', 0.7505263082542397), ('Utp11l', 0.4190139865647753), ('Ppfia4', 0.59622490938284), ('Sugp2', 4.864268360304388), ('Prdx6', 1.3107060488146285), ('Ceacam1', 2.6686498467913062), ('Atf1', 0.619279007557783), ('B3galt2', 1.0929702612979697), ('Pabpc4', 0.8218154311808581), ('Ier3', 10.064569864928544), ('Akap12', 0.5700757677524456), ('Acy1', 1.056774664398637), ('Pcm1', 0.4845314890416012), ('Usp2', 0.6181841109820803), ('Znrf2', 0.4630347193433489), ('Creld1', 0.4950311219269024), ('Gprc5c', 0.6436159200634183), ('Hrsp12', 0.5976996293267264)] | 794

Within the cross-comparison I was actually more curios about how the joint clustering was conducted in the manuscript (the next part of figure 2 shown below). In this case were regulons from both mouse and human used in AUCell for a merged expression matrix of mouse and human cells or maybe was the clustering done on each binarised dataset (mouse, human) separately and the common cell type annotations were used to unify the cross-species comparison?

Screenshot 2022-01-31 at 10 57 37

As a test I did want to try combining regulons from different species into a common file and running AUCell on the merged expression matrix. I would imagine it would require the following steps:
- 1. Generate regulons e.g. regulon_mouse.csv & regulon_human.csv
- 1. Convert gene names of regulon_mouse.csv and the expression_matrix_mouse.csv to human.
- 1. Merge the expression matrices from both species based on common gene homologues.
- 1. Merge/concatenate the converted regulon_mouse.csv and the regulon_human.csv files (would require specifying which regulon came from which dataset)
- 1. Run AUCell for the regulon_merged.csv using the expression_merged.csv file.

If using pyscenic, what would you recommend for subpoint 2 & 4? Or would using R be easier for this?

Thank you.

Best,

Anirudh

I'll answer your question in parts, following your order.

The important column in the file you are referring to (for your analysis) is the TargetGenes column. This column contains all the target genes of a certain TF together with their importance scores (encoded as a tuple). e.g. These are the first three targets of the TF AU041133 (on a side node, I most say the TF name looks a bit weird to me... do all your TFs have this "AU..." naming convention? Might be something related to your analysis that I'm unaware of).


[
  ('Utp11l', 0.4190139865647753),  #Utp11l is a target of AU041133 with importance score 0.4190139865647753
  ('Ppfia4', 0.59622490938284),    #Ppfia4 is a target of AU041133 with importance score 0.59622490938284
  ('Sugp2', 4.864268360304388), #Sugp2 is a target of AU041133 with importance score 4.864268360304388
...
]

You can read this file both in R or in python that's up to you (it is just a flat text file, so both can read them). You just need to correctly parse this column.

For example, in python, you could make a dictionary with TF's as key and target genes as values as explained in the downstream analysis tutorial . From the tutorial:


# create a dictionary of regulons:
regulons = {}
for i,r in pd.DataFrame(lf.ra.Regulons,index=lf.ra.Gene).iteritems():
    regulons[i] =  list(r[r==1].index.values)

After that you can use pybiomart to do the conversion.

For this figure following steps were done: First, the gene names in the mouse expression matrix and mouse regulons were converted to human orthologs. Second, the expression matrix of the human-converted mouse data was merged with the human datasets. And third, the human regulons and human-converted-mouse regulons were scored in this datasets using AUCell (the regulons were not merged, so in fact one regulon can be scored twice: once for human and once for mouse like the DLX1 example above.). These AUC scores were used to do the clustering. So as you state: "In this case were regulons from both mouse and human used in AUCell for a merged expression matrix of mouse and human cells" --> this is correct.
As to your question as to wether R or python is more easy for this, that's up to you. Whatever you feel most comfortable with. Either way I think the workflow you suggest is reasonable. For running the aucell step of pyscenic using the command line on the merged regulons.tsv file, you would have to make sure that the formatting is correct. For separating human and mouse regulons you could append the species name to the TF name (for example AU041133 would become AU041133_mouse). To do the actual merging you can use pandas operations (for example pandas.concat). To do the conversion from mouse to human gene names I would recommend using pyBiomart.

Hi @SeppeDeWinter ,

Thank you so much for this, it really addresses the rational. Just a a few details I want to confirm regarding the creation of a merged regulon object/file which can be fed into pyscenic aucell.

Regarding the merged regulon.tsv, after generating the dictionaries (one for human and the other for human-converted-mouse regulons) with TargetGenes converted to human orthologs (using pybiomart) and TFs renamed based on their respective species, how would these two dictionaries be merged and subsequently used in AUCell CLI step. Based on the tutorial, pyscenic aucell takes the flat file "regs.csv" (formatted according to the sample I had attached previously) generated from pyscenic ctx .
For regulons to be scored twice (once for mouse and once for human) as you have described for DLX1, was a strategy similar to your point 3 applied? E.g.
1. For the human-converted-mouse regulons, TFs were annotated as "DLX1_mouse" and for human "DLX1_human".
2. Subsequent to this step a joint regulon file was created comprising of these human and human-converted-mouse regulons then tested for enrichment? or human regulons were first tested for enrichment via AUCell and then the human-converted-mouse regulons thereby generating two AUCell matrices which were joined based on the cells.

Thank you again,

Best,

Anirudh

Hi Anirudh

This will need some figuring out from your part. There are many approaches for doing this, as long as the final regs.csv file is formatted correctly it is fine. Maybe in this case it is easier to not make a dictionary but do the conversion within dataframes and merge them. I don't have a script for this on hand.
As for your question wether to run AUCell twice (once with the human regulons and once with the mouse once) and merge the matrix, or once with all together. As long as you don't do any intermediate scaling/normalization steps this shouldn't matter, the result will be the same in each case.

Hope this helps.

Best,

Seppe

Hi @SeppeDeWinter ,

Thanks for this. I'll keep you updated with any code I generate for this.

I'll just summarise our discussion here. You can close this issue if there isn't anything more you wish to add. Again thank you for your concise and prompt inputs.

Cross-species clustering, explained for mouse (mm) and human datasets (hu):

Expression matrix pre-processing:
1. mm_ExpMat gene names converted to human orthologues -> mm_ExpMat_huGene.
2. Merge mm_ExpMat_huGene and hu_ExpMat by shared genes -> merged_ExpMat.
Regulon pre-processing:
1. Calculate regulons for mouse mm_Regulon and human hu_Regulon separately using respective expression matrices (mm_ExpMat & hu_ExpMat).
2. For the mouse regulons mm_Regulon convert gene names to human orthologues mm_Regulon_huGene and if needed rename TFs to specify species e.g. Dlx1 -> Dlx1_mm.
AUC files: this can follow two approaches which should give the same results.
1. Approach 1: Merge regulons from mouse mm_Regulon_huGene and human hu_Regulon -> merged_Regulon. Calculate AUC using merged_Regulon & merged_ExpMat -> merged_AUC.
2. Approach 2: Calculate AUC on the merged_ExpMat using mm_Regulon_huGene and hu_Regulon seperately and join the two resultant AUC matrices (mm_AUC_huGene & hu_AUC) by cells -> merged_AUC.

Thank you,

Best,

Anirudh

Hi Anirudh

I think this summary captures everything.

Good luck! Feel free to reopen the issue when you encounter problems.

Best,

Seppe

aertslab / pySCENIC

Cross-species/dataset comparison #364