csbl-usp / CEMiTool

Co-Expression Module Identification Tool (CEMiTool) official repository
22 stars 9 forks source link

Error while running through singularity container #82

Closed brettChapman closed 2 years ago

brettChapman commented 2 years ago

Hi

I've produced a singularity container from the Dockerfile here: https://github.com/csbl-usp/CEMiTool/blob/master/docker/Dockerfile

I run this command:

srun -n 1 singularity exec --bind ${PWD}:${PWD} ${CEMITOOL_IMAGE} /usr/local/lib/R/site-library/CEMiTool/exec/CEMiTool.R ${EXPRESSION_MATRIX} -s ${PHENOTYPE} -o ${OUTPUT}

I've been getting these errors while running:

mergeCloseModules: Merging modules whose distance is less than 0.2
   Calculating new MEs...
Error in .local(cem, ...) : 
  Invalid rank_method type. Valid values are 'mean' and 'median'
Calls: do.call -> <Anonymous> -> mod_gsea -> mod_gsea -> .local
In addition: Warning message:
executing %dopar% sequentially: no parallel backend registered 
Execution halted
srun: error: node-8: task 0: Exited with exit code 1

It complains about an incorrect rank_method, but I don't specify a rank_method so it should use the default 'mean' value.

I also tried running directly from Docker, but I received permission denied errors.

This is all running on my cluster. I also tried running on my macbook pro through RStudio, and while installing I get conflicts about R versions, so I don't think it's install correctly, hence why I'm trying to use the container version. Thanks for any help you can provide.

pedrostrusso commented 2 years ago

Hi @brettChapman, I see you're using the CEMiTool.R script in the exec folder. Unfortunately, we haven't updated that script for the longest time, so it's wildly out of date. We currently only use it for some tests. You can try creating your own script based on it if you want. THAT SAID, I just glanced at it and saw a typo in the rank_method definition and I fixed it, so maybe it will work now lol. I still strongly recommend you don't use the script though.

brettChapman commented 2 years ago

Hi @pedrostrusso

I built my container using this Dockerfile:

FROM csblusp/cemitool:builder

RUN R CMD INSTALL .

ENTRYPOINT ["/usr/bin/Rscript", "/usr/local/lib/R/site-library/CEMiTool/exec/CEMiTool.R"]

Which points to that CEMiTool.R script as an entrypoint. Singularity doesn't have entry points like Docker, so I call the script directly. Is there another Dockerfile I should be using to build the container or a different R script I should be calling?

I've been following the examples here: https://github.com/csbl-usp/CEMiTool/blob/master/docker/example.md

However I can not see the csblusp/cemitool folder in my container. Perhaps I should be building the container using the "Dockerfile.build" file instead of the "Dockerfile" file?

I'll try building the container again using the other Dockerfile and update you.

On a different note, once I'm up and running, should running from the container produce all different plots, much like this tutorial explains when running from RStudio: https://www.bioconductor.org/packages/release/bioc/vignettes/CEMiTool/inst/doc/CEMiTool.html, or would I have more control over different visualisations if I were to run within RStudio instead of the singularity container? Thanks.

brettChapman commented 2 years ago

Hi @pedrostrusso

I've managed to get it working in RStudio now, however when I load my data in I get what appears to be warning messages:

> cem <- cemitool(datExpr)
Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too

I've attached a snapshot from RStudio of my expression matrix before it gets read into cemitool.

Screen Shot 2021-08-26 at 4 28 17 pm

Thanks.

pedrostrusso commented 2 years ago

Hi @brettChapman I'm glad you got it to work. Those warnings just mean that the algorithm for calculating the gamma functions used in the gene filtering step are operating outside of their optimal range. While the p-values obtained for gene filtering won't be perfectly exact, they should still be within an acceptable range. For more details, you can take a look at the pracma::gammainc function. I'll close the issue for now.

brettChapman commented 2 years ago

Hi @pedrostrusso

Thanks for your help. I find cemitool runs for quite a while through RStudio. I'd ideally like to get this working through my cluster to streamline analysis for a lot more RNA-seq data (pan-transcriptome level) instead of tying up my macbook pro indefinitely.

I think I've managed to get it working from a Dockerfile, but I'm not sure if I'm running from the correct R script. The one I found in /CEMiTool/R/ doesn't appear to be executable.

My Dockerfile:

FROM ubuntu:21.04

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y dialog apt-utils git r-base r-base-dev \
    libssl-dev libssh2-1-dev pandoc libcurl4 libcurl4-openssl-dev libxml2-dev libudunits2-dev \
    && rm -rf /var/lib/apt/lists/*

RUN git clone --recursive https://github.com/csbl-usp/CEMiTool.git

WORKDIR /CEMiTool

RUN /usr/bin/Rscript docker/install-deps.R 

RUN R -e "install.packages(c('BiocManager', 'httr', 'europepmc'))"

RUN R -e "BiocManager::install('CEMiTool')"

Running through Singularity:

singularity exec --bind ${PWD}:${PWD} cemitool.sif /usr/bin/Rscript /CEMiTool/R/cemitool.R
[1] "expr_data"
[1] "expr_data<-"
[1] "mod_colors"
[1] "mod_colors<-"
[1] "sample_annotation"
[1] "sample_annotation<-"
[1] "nmodules"
[1] "mod_gene_num"
[1] "mod_names"
[1] "module_genes"
[1] "write_files"

I'm also interesting in producing the top 10 hub genes for each module. I've seen online I can run through RStudio using:

cem <- find_modules(cem)
hubs <- get_hubs(cem, 10)

However, I can't see how the script if running from Docker can also produce a list of top hub genes, although I do see when using the web version of cemitool there is a top 5 of hub genes produced, just no text file anywhere I can see of the top hub genes.

I'm also interested in producing network visualisations as described here: https://www.bioconductor.org/packages/release/bioc/vignettes/CEMiTool/inst/doc/CEMiTool.html#adding-interactions, but I would like to visualise the hub genes and how they're connected in the network instead of just some gene-gene interactions provided as tab-separated input file. If I produce the top 10 hub genes for each module, could I provide these as input to produce the network plot? Are the edges in the network representing eigenvector values or correlation coefficients? I'm also interested in using external software for visualising the network. I'm familiar with D3.js, having produced some force-directed network plots before, and was wondering if the network can be outputted as a JSON file to import into other software. Thanks.

pedrostrusso commented 2 years ago

@lecardozo would you mind giving us a hand with the Docker-related questions?

About your other questions, there is no currently implemented way to override the hubs, but looking at the source code for the plot_interactions function, you should be able to replace the output of the get_hubs function with your own list, but it will require you replace it with your own custom function. The edges in the network plots solely represent if the given genes are connected or not - that's why there is no gradient of colors in edges, and is why the network outputted is given as gene-gene interactions provided as a tab-separated file.

brettChapman commented 2 years ago

Thanks @pedrostrusso

I'll look into writing my own function once I figure out how to get cemitool running properly on my cluster. Hopefully @lecardozo can help with this.

Since the network doesn't have any specific values assigned to the edges, what I think I'll also try is to extract out the hub genes and run a pearson's correlation of these subset of genes from the original expression matrix, then pass the nodes and edges into D3.js to visualise how these hub genes correlate. I can also group the hub genes in the D3.js force-directed network by their module name, which should provide some interesting plots.

brettChapman commented 2 years ago

I was just taking a look at the Docker version. It's 4 years old, so I'm assuming it was an early release, which should probably be updated.

I noticed some R scripts referring to QQ plots. Do you still produce these plots as part of the main cemitool.R script? Producing some QQ plots prior to running correlation analysis would be real handy to have.

Docker version:

singularity exec --bind ${PWD}:${PWD} /data/cemitool_builds/cemitool_docker_version.sif ls /CEMiTool/R/
cemitool.R  datasets.R  diagnostics.R  enrichment.R  filter.R  integrate.R  interactions.R  modules.R  report.R  stat-qq-line.R  stat-qq.R  utils.R  visualization.R

Github version:

ubuntu@node-0:/data/cemitool_builds$ singularity exec --bind ${PWD}:${PWD} /data/cemitool_builds/cemitool.sif ls /CEMiTool/R/
cemitool.R  datasets.R  diagnostics.R  enrichment.R  filter.R  interactions.R  modules.R  report.R  utils.R  visualization.R
brettChapman commented 2 years ago

It may be worth updating the executable file here: /CEMiTool/exec/CEMiTool.R to be inline with the latest R script in /CEMiTool/R/cemitool.R. That would be the only way I could pass in my expression data while running from the singularity container.

I imagine the version of cemitool running from the web version (https://cemitool.sysbio.tools/analysis) has it's own custom scripts which pass in the expression data. Is there a repository of the web version I could access if that's the case?

brettChapman commented 2 years ago

I've been trying to compare the web version with the version I'm using in RStudio.

When I run in RStudio with this command:

cem <- cemitool(datExpr, rgt_planet_phenotype, apply_vst=TRUE, cor_method="pearson")

I get this output during the run:

The order of those tied genes will be arbitrary, which may produce unexpected results.There were 1 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)There are ties in the preranked stats (0.02% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.There were 2 pathways for which P-values were not calculated properly due to unbalanced (positive and negative) gene-level statistic values. For such pathways pval, padj, NES, log2err are set to NA. You can try to increase the value of the argument nPermSimple (for example set it nPermSimple = 10000)There are ties in the preranked stats (0.06% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.Arguments 'x' and/or 'a' are too large.Arguments 'x' and/or 'a' are too large.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.`fun.y` is deprecated. Use `fun` instead.

I'm then able to run find_modules and get_hubs, but I'm not sure if the results are accurate due to the errors/warnings.

If I run with apply_vst=FALSE these warnings do not display, but then after a while RStudio gets bogged down and unresponsive, often ending in my R session being killed.

If I run the web version of the tool without using applying vst I get an error: No beta value found. It seems that the soft thresholding approach used by CEMiTool is not suitable for your data. Click here for more information about this limitation.

If I run the web version with apply vst I get results, but I assume somewhere those warnings/errors are appearing, which makes me question the results.

Ultimately I still want to get running from my singularity container so I can streamline the process and scale up to many more RNA-seq datasets.