churchmanlab / genewalk

GeneWalk identifies relevant gene functions for a biological context using network representation learning
https://churchman.med.harvard.edu/genewalk
BSD 2-Clause "Simplified" License
127 stars 14 forks source link

No regulators identified? #37

Closed npokorzynski closed 3 years ago

npokorzynski commented 3 years ago

Hi,

I've run 10 gene sets through genewalk from an RNA-seq experiment (various treatment conditions with up- or down-regulated genes).

While 9/10 gene sets have produced expected results, one particular gene set fails to identify any regulators, i.e. the scatterplot is empty with the exception of a few dots on the x-axis and the genewalk_regulators.csv is empty. Despite this, the barplot folder is populated with 688 figures, so its not clear to me if this is a true reflection of the gene set I've provided or some kind of error. I've attempted to re-run this analysis on a few different occasions by re-generating the source gene list file (thinking it was corrupted in some way maybe? Just a wild guess). Nothing has seemed to help.

For reference, the analysis is being conducted on MacOS 11.2.1 with Python v3.8. The code I'm using for the analysis is below:

$ genewalk --project UTD24_DOWN --genes UTD24_DOWN.txt --id_type ensembl_id --nproc 4 --nreps_graph 10 --nreps_null 10

I've also attached the output log file, results, scatter plot and regulators spreadsheet.

genewalk_all.log genewalk_results.csv.zip

genewalk_regulators.csv.zip regulators_x_gene_con_y_frac_rel_go.pdf

ri23 commented 3 years ago

Hi @npokorzynski

According to your genewalk_results.csv file, all the global_padj values are larger than 0.1. As described on our README page at input option --alpha_fdr: by default the visualization stage uses a FDR threshold of 0.1. In this case, no regulators were identified because none of the genes have any relevant (global_padj < alpha_fdr = 0.1 by default) GO annotations. So yes the output you got is as expected, no error.

You mentioned you have tried to rerun GeneWalk with the same gene set as input with no change in results. In that case I do think this really does mean that this gene set does not provide very strong statistically significant results.

However there are some genes with global_padj = 0.101, so very slightly above 0.1 so you could rerun the visualization stage with a more lenient FDR threshold --alpha_fdr 0.15 for instance.

genewalk --stage visual --alpha_fdr 0.15 --project UTD24_DOWN --genes UTD24_DOWN.txt --id_type ensembl_id --nproc 4 --nreps_graph 10 --nreps_null 10 

Then you'll get at least some genes with relevant functions global_padj < alph_fdr = 0.15, but perhaps still no genes classified as "regulators", which require the fraction of relevant GO annotations > 0.5, i.e. more than half of a gene's GO annotations need to have global_padj < alph_fdr and gene_padj < alph_fdr. See here for that code https://github.com/churchmanlab/genewalk/blob/master/genewalk/plot.py#L80

If I may give some more advice: I would recommend running genewalk with UP and DOWN regulated genes together in one list. So better not to separate UP and DOWN genes (as you are probably used to from enrichment analysis) for different genewalk runs. The reason is that GeneWalk also makes use of reaction statements like gene1 inhibits gene2. So if the list only contains UP or DOWN regulated genes, these statements are then not included in the network if gene1 was up and gene2 was down whilst it is actually relevant information for the algorithm to do its job.

So bottom line: use all your DE genes from a particular condition in one run. Then afterwards you can split the genewalk output yourself according to which genes were UP and DOWN.

Best of luck with your further research.

Robert

npokorzynski commented 3 years ago

@ri23 Ah, okay thank you for clarifying that. I didn't realize that, unlike the barplots, it would fail to define regulators if they didn't reach significance. I assumed that many of the regulators in other quadrants on the scatterplot didn't reach significance so it never occurred to me to look back over that.

As for the use of all DE genes in a list, can you explain in a little more detail what data defines the reaction statements? For example, is this based on ChIP data or something similar? And does it consider functional aspects of the regulation, such as up- or down-regulated? Relatedly, is there a way to input genes into the list such that they are annotated as up- or down-regulated? Thus far I have been inputting the gene list a single column of ensembl ids. Or is there an intuitive way to process them in this fashion after the genewalk analysis?

ri23 commented 3 years ago

Hi @npokorzynski

As for the use of all DE genes in a list, can you explain in a little more detail what data defines the reaction statements? For example, is this based on ChIP data or something similar? For this, I would recommend reading up on the primary publications of PC and INDRA: reference 18 (Cerami et al), 42 (Rodchenkov et al) for PC and reference 21 (Gyori et al) for INDRA in our publication. PC is a manually curated database, whilst INDRA also includes reaction statements generated from automated literature mining. It is based on various sorts of experimental evidence, like protein-protein interactions, kinase assays, etc.

And does it consider functional aspects of the regulation, such as up- or down-regulated? Yes, both INDRA and PC contain these sorts of a regulation as part of their reaction statements.

Relatedly, is there a way to input genes into the list such that they are annotated as up- or down-regulated? Thus far I have been inputting the gene list a single column of ensembl ids. Or is there an intuitive way to process them in this fashion after the genewalk analysis? For the GeneWalk algorithm, the order of the ensembl IDs input lists does not matter, but it only takes a single input list. So I would just concatenate the UP and DOWN list ensembl IDs into a single list of ensemb IDs. In this way, you know the UP genes come first and the DOWN list follows from the first DOWN ensembl ID onwards. The genewalk_results.csv and barplots in the html file output the genes in the same order as the input list, so you can afterwards simply split the results csv file: the UP genes first followed by the DOWN genes.

npokorzynski commented 3 years ago

@ri23 Great, thank you for the advice and the help! Much appreciated.