churchmanlab / genewalk

GeneWalk identifies relevant gene functions for a biological context using network representation learning
https://churchman.med.harvard.edu/genewalk
BSD 2-Clause "Simplified" License
127 stars 14 forks source link

Criteria for barplot output #33

Closed npokorzynski closed 3 years ago

npokorzynski commented 3 years ago

Hi all,

Quick question - can you explain the criteria used to determine which barplots are automatically generated in the output files? The github page says "barplots with GO annotations ranked by relevance for each input gene that GeneWalk was able to generate results for," but I'm not sure if this is supposed to explain why only a subset of the bar plots get produced and on what basis they are selected.

Thanks, Nick

ri23 commented 3 years ago

Hi @npokorzynski ,

In principle barplots are generated for all input genes. If a barplot for a gene is not generated, it is either because there are no GO annotations for that gene or it could not be included in the GeneWalk network for analysis. In the former case the gene is still listed in genewalk_results.csv, but with 0 GO annotations. The latter can happen when there is no uniprot ID for example: all those drop out genes with the reasons for drop out are listed in genewalk_all.log.

Cheers, Robert

npokorzynski commented 3 years ago

@ri23 Thank you for that reply. Perhaps I'm misunderstanding then: When I run genewalk, the genewalk_results.csv contains far more genes with associated GO terms (many of which are significant below FDR p 0.01) than bar plots produced. I also am observing that between conditions a variable number of bar plots are generated, so that gave me the impression that there was some criteria selecting which genes barplots would be produced for under a given condition. Perhaps you can clarify what is causing this?

ri23 commented 3 years ago

This is unexpected, at the moment I am not sure what is happening.

Can you clarify: do you mean some barplots are not generated in the figures/barplot folder as a .png file, and/or they are not visible in the index.html file?

Can you check what the confidence interval cilow_global_padj values are for genes that are present in genewalk_results.csv but without plot?

Moving forward, would you be able to share your genewalk_results.csv and genewalk_all.log files and mention a few genes that did not have a bar plot generated? My email is robert_ietswaart@hms.harvard.edu if you don't want your files here on github in the public domain. Alternatively a screen shot of the above could also help to figure this out.

Cheers, Robert

npokorzynski commented 3 years ago

@ri23 With some troubleshooting I may have revealed at least one part of the problem, though I'm not sure why this would cause issues:

It turns out that if I run the analysis with Ensembl IDs instead of HGNC symbols, I produce the expected index file and barplots folder containing all barplots. In the above, I was getting the results csv file with all the relevant genes, but only a portion of the barplots and no index.html file.

Here are the log and results files for the analysis I ran using hgnc_symbol as the ID:

genewalk_all.log

genewalk_results.csv.zip

This has led me to a related but distinct question - in the analysis run with ensembl_id (from the same starting gene list just with the IDs changed, which contains 1500 genes) I receive an output of 975 genes with associated GO terms (which seems high to me, e.g. 65% of my input is returned as functionally relevant?), with very few even reaching FDR p < 0.1. In fact, 0.0946 is the lowest p-value I get from any of the results. If needed, I can start a new issue for this, but I've attached the log and results files for this analysis as well.

genewalk_all.log

genewalk_results.csv.zip

ri23 commented 3 years ago

Hi @npokorzynski ,

From the log file with hgnc symbols as input, I suspect GeneWalk got killed after 3h. Could it be that you ran a job on a cluster or something with this time limit (from 14:52 to 17:52)? For the ensembl input, you ran it with parallelization (great!) so GeneWalk ran faster (~1h15min) and finished as expected. According to the log file: genewalk.nx_mg_assembler - Number of PC originating nodes 1027 vs genewalk.nx_mg_assembler - Number of PC originating nodes 1029 So the number of gene nodes in the network is almost identical (1027 vs 1029 gene nodes): almost all hgnc_symbols are recognized the same as the ensembl IDs. The rest of the 1500 input genes are filtered out because they are pseudogenes/ as and lncRNAs etc without Uniprot IDs as listed also in the log file.

If you want to check you can rerun the hgnc symbol input with parallelization and it should provide all the barplots and index.html within the same time frame as your ensembl ID input run.

Regarding your second question: Your genewalk_results.csv output contains 975 genes: those are all input genes with at least 1 GO annotation (subset of 1029 in the network). So, the genes are in there irrespective of whether the GO annotations are statistically significant or not.

Out of the 975 a lot can have 1 or more significant (global_padj<0.1) GO annotation per gene. In our publication, we mention that for the QKI context, this was the case for 54% of the genes. We interpret this as a lot genes have some significant GO annotation, but not all annotations are relevant for a given gene and different annotations are relevant for different genes (unlike the typical GO enrichment results, which test for global relevance, so you get the 1 GO term for a set of genes).

Also yes the global_padj does not extend much below 0.1, this is because of the FDR multiple testing correction across all ~11k gene-GO term pairs.

In the paper we do mention there is limited stochasticity between independent replicate runs (Supplementary Fig. S2), because GeneWalk relies on random walk sampling. As a result the cosine similarities, resulting p-value and FDR adjust p-values can vary slightly between runs. This is also why we provide 95% confidence intervals on the p-values. And this explains why some of the global_padj of your hgnc symbol run are slightly lower than 0.0946, the lowest value of your second run.

If such variability due to stochasticity bothers you: you can choose to rerun GeneWalk with --nreps_graph 10 and --nreps_null 10. This decreases the stochasticity because GeneWalk now runs the network representation learning DeepWalk part 10 times instead of 3 and the null distributions is also more populated from 10 replicate runs instead of 3.

If you are interested in further gene selection, then focus in on the regulator scatter plot in the top right corner. There are the genes with a high fraction of relevant GO annotations and many gene-gene connections. Once you find a gene of interest from there, then look at the barplot for that gene to understand what their relevant functions are.

So in conclusion all your output is as expected. I hope these clarifications help you interpret the results further.

Best, Robert

npokorzynski commented 3 years ago

@ri23 Thank you that was incredibly helpful! Do you happen to have a code snippet to recreate the scatterplots? If not, I can close out this issue.

ri23 commented 3 years ago

Hi @npokorzynski Great, yes so we save the scatterplots by default also as separate .png, .pdf and the interactive plot as .html files in the figures folder. If you want to recreate, you can call the GW_Plotter.scatterplot_regulators function or copy over code from there: https://github.com/churchmanlab/genewalk/blob/master/genewalk/plot.py#L57 The package dependencies are listed at the top of the plot.py file.

npokorzynski commented 3 years ago

@ri23 okay, I guess this must have also been a consequence of the incomplete run from my non-parallelized attempts - I had only previously observed the .png file as an output (the resolution of which is understandably not great). In the complete/parallelized run I get the .pdf file. In that case I'm all good, for now at least. Thanks again for all the help!