understanding the inputs for EWCE

gouthamatla commented 4 years ago

Hi,

I have a few questions regarding EWCE. Its a very useful tool. I have gone through paper quickly and tried EWCE.

I would like to know if we need to normalize the scRNA data prior to running EWCE ?
There will be many genes with zero counts in scRNA datasets. As there is drop.uninformative.genes function that takes care of noise, but I don't think its doing the job well. When I input all genes (~19,000), after running drop.uninformative.genes, it still retains 16,000 which is a huge number given the low capture rate of scRNA. How does this function takes care of genes with 0 counts across many cells ?
bootstrap.enrichment.test is this function testing if the gene expression distribution is "higher" than the background set or "different" than background set ?

Also the Vignette is more tailored for mouse/human conversions, so this is my code I am using as I have human data. Would like to know if I am missing anything.

annot <- read.table("metadata_for_EWCE_v2.txt", header = T, sep ="\t")

SCT <- read.table("scRNA_for_EWCE_GeneNames.txt", header = T, row.names = 1, sep = "\t")

annotLevels = list(level1class=annot$level1class,level2class=annot$level2class)

exp_DROPPED = EWCE::drop.uninformative.genes(exp=SCT,level2annot = annot$level2class)

fNames = EWCE::generate.celltype.data(exp=exp_DROPPED, annotLevels=annotLevels, groupName="Foo", no_cores=10)

load(fNames[1])

GWAS_named_genes  <- as.vector(read.table("GWAS_Named_Loci.txt")$V1)

full_results = bootstrap.enrichment.test(sct_data=ctd,sctSpecies="human", 
                                genelistSpecies="human",
                                hits=GWAS_named_genes,
                                                                bg=rownames(exp_DROPPED),
                                reps=10000, annotLevel=2)

NathanSkene commented 3 years ago

Hi @gouthamatla ,

I would like to know if we need to normalize the scRNA data prior to running EWCE ?

It certainly doesn't need to be done. I've added scTransform to the tutorial as it may help but it is not a requirement: it was not done for the initial EWCE paper.

There probably is a better way of writing drop.uninformative.genes but it's a balancing act. Some important genes can be lowly expressed (as far as I recall, Pvalb, a canonical interneuron marker, was in the original Zeisel 2015 was an example of this). So you don't want to throw things away just because they are lowly expressed. 16000 seems quite reasonable to me for an scRNA-seq dataset spanning numerous cell types. Each cell type will have a different spread of genes expressed.
It tests if the gene set is 'more specific'. This is not the same as just 'highly expressed' (as the gene may be similarly highly expressed in many other cell types).

Code looks correct to me. Let me know if it doesn't work as expected.

gouthamatla commented 3 years ago

Thank you very much for the answers.

NathanSkene / EWCE

understanding the inputs for EWCE #19