eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
110 stars 26 forks source link

Running a cluster with Slurm #64

Closed brettChapman closed 4 years ago

brettChapman commented 4 years ago

Hi

I'm trying to setup get_homologues-est with my cluster. I have followed the instructions from here: https://hub.docker.com/r/csicunam/get_homologues using Docker to setup. I've mounted my local directory, downloaded PFam and Swissprot with the perl install.pl script, and I have a file called HPC.conf with my slurm configuration. I was wondering where the HPC.conf file is supposed to be placed? Will running with -m runmode set to cluster automatically detect the HPC.conf file? Will the script running from my master node automatically schedule jobs to my worker nodes? Does my master node conduct any analysis, as it does not have the same compute resources as my worker nodes. Thank you, and thanks for such a great tool!

brunocontrerasmoreira commented 4 years ago

Hi @brettChapman , the file should be in the same folder as the get_homologues.pl script. It wont work with docker unless you create a new container with slurm installed, hope this helps, Bruno

brettChapman commented 4 years ago

Hi Bruno

Thanks. Do you mean I need to create a new Docker container from a Dockerfile? I have tried that previously, creating a modified Dockerfile supplied from Docker hub, installing the latest bin files from https://github.com/eead-csic-compbio/get_homologues/releases/download/v3.3.3/bin.tgz, Pfam and uniprot databases. Would I need to install slurm inside the container as well, even though I have slurm installed on the host system?

I've also had issues before with R dependencies, particularly d3heatmap not installing as it's no longer available from r-studio. I saw mention of you creating your Docker images now with R binaries. Do you have an updated Dockerfile I could use with the R binaries included or a link to the R binaries so I no longer need to install the R dependencies?

Thanks.

eead-csic-compbio commented 4 years ago

Hello again @brettChapman, you can import our dockerfile as is into your new dockerfile. That should take care of all our dependencies just fine. It is within the new dockerfile that you should describe how slurm must be installed and configured. If you get that working we will ask you to adopt it in the future if that's all right :-) R dependencies have been a problem for us recently, that's why we moved to binaries in https://hub.docker.com/r/csicunam/get_homologues However, @pvinuesa recently downgraded a couple of scripts withR dependencies and probably the original R repo (repos='https://cloud.r-project.org') would work again, Bruno

brettChapman commented 4 years ago

Hi Bruno

Thanks for the advice. I'm still getting my head around Docker and wasn't familiar with writing Dockerfiles in multistage. I checked the https://cloud.r-project.org repo and it still wont install dependencies.

I've put a Dockerfile together here:

FROM csicunam/get_homologues:latest AS GETHOMOLOGUES

FROM ubuntu:18.04 FROM rstudio/r-base:3.6.3-bionic

WORKDIR /root/

COPY --from=GETHOMOLOGUES / .

Install dependencies from repos: GD for graphics, libidn11 for BLAST+

RUN apt-get update && apt-get install -y \ slurm-wlm \ bash-completion \ build-essential \ bc \ curl \ git \ htop \ libgd-gd2-perl \ libidn11 \ libpython2.7 \ procps \ wget \ && rm -rf /var/lib/apt/lists/* RUN curl -L http://cpanmin.us | perl - App::cpanminus RUN cpanm Inline::C Inline::CPP

RUN cd get_homologues && echo "yes y | perl install.pl" > install.sh

RUN cd get_homologues && bash ./install.sh

It's managed to install ok into Docker, although I'm not sure if R and all dependencies was copied across alright from the get_homologues docker image.

I've also installed slurm-wlm to provide slurmctld and slurmd binaries. In terms of configuring slurm within the Docker image, should it be configured exactly the same way as on the host system, using the same slurm.conf, slurmdbd.conf files, IP addresses, access to slurmdbd binaries and MariaDB etc....there's quite a lot involved, or does it simply need access to the slurm binaries and a simplified slurm.conf file with the master and worker node IPs? If I need to replicate slurm configuration from the host system, do you know of a way I could get the files from the host system into the Docker image during it's build? The only thing I can think of is setting up a private github page with my current slurm configuration files to pull down.

Thanks.

brettChapman commented 4 years ago

I've slightly updated the Dockerfile:

FROM csicunam/get_homologues:latest AS GETHOMOLOGUES

FROM ubuntu:18.04 FROM rstudio/r-base:3.6.3-bionic

RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

COPY --from=GETHOMOLOGUES / .

Install dependencies from repos: GD for graphics, libidn11 for BLAST+

RUN apt-get update && apt-get install -y \ slurm-wlm \ bash-completion \ build-essential \ bc \ curl \ git \ htop \ libgd-gd2-perl \ libidn11 \ libpython2.7 \ procps \ wget \ && rm -rf /var/lib/apt/lists/* RUN curl -L http://cpanmin.us | perl - App::cpanminus RUN cpanm Inline::C Inline::CPP

RUN cd get_homologues && echo "yes y | perl install.pl" > install.sh

RUN cd get_homologues && bash ./install.sh

add version name to image

ARG version LABEL version=$version RUN echo $version

prepare user env

USER you

WORKDIR /home/you ENV PATH="/get_homologues:${PATH}" ENV PATH="/get_phylomarkers:${PATH}"

When running in cluster mode from within Docker, does Get_homologues-est build other Docker images on the other compute nodes?

I'm trying to setup Slurm in the image, but I just realised that my install on the host system is Slurm 19.05.5 under Ubuntu 20.04, while Get_homologues-est is running on Ubuntu 18.04 with Slurm 17.11.2. Would different Slurm configurations between the host and the Docker image be an issue, or like I just inquired about, does Docker deploy on all compute nodes with the same image of Get_homologues-est as on the master node? Thanks.

brettChapman commented 4 years ago

I'm coming up with a problem when configuring the hosts for my Slurm configuration. I pull down all my confguration files for setting up Slurm from a private GitHub repo. I'm trying to add the list of IPs to /etc/hosts within my Docker build, but I get this error:

Step 10/21 : RUN git clone https://${TOKEN}@github.com/brettChapman/get_homologues-est_slurm.git ---> Using cache ---> 46c8582d57f6 Step 11/21 : RUN mv /get_homologues-est_slurm/hosts /etc/hosts ---> Running in 99bad5e699bf mv: cannot move '/get_homologues-est_slurm/hosts' to '/etc/hosts': Device or resource busy

Any ideas on getting around this?

Thanks.

eead-csic-compbio commented 4 years ago

Hi @brettChapman , I always thought this could be done installing slurm on a single container. I haven't explored this though, so I guess you are on your own unless some of the following 18.04/centos pointers help:

https://ubuntuforums.org/showthread.php?t=2404746 https://www.linuxwave.info/2019/10/installing-slurm-workload-manager-job.html https://stackoverflow.com/questions/58623954/cant-run-parallel-jobs-with-slurm-on-ubuntu-18-04-on-same-machine https://ecotoxxplorer.github.io/galaxyserver/ https://github.com/jafreck/docker-ubuntu-slurm

Thanks for your efforts, Bruno

brettChapman commented 4 years ago

Hi @brunocontrerasmoreira I had a look at those links, and the last one about docker-ubuntu-slurm is something I've looked at before. It runs slurm on two slurmd containers, but it doesn't appear to run on multiple machines and is mostly for development purposes and not large pipelines.

I've come up with an alternative. I usually run tools on my cluster through Singularity. After pulling my local get_homologues-est docker image into Singularity I tried running get_homologues-est.pl on 1 node and it works fine. What I could do is send off multiple stages of the get_homologies-est pipeline either in parallel or sequentially, depending on the different stages and their inputs/output dependencies. Does each script pull output from other scripts in the cds_est_homologues/ folder which it generates? If this is the case, I could simply submit multiple different jobs to each node in my cluster. I'll just need to know which jobs depend on other jobs before running. From the looks of it, I could simply run multiple get_homologues-est.pl jobs in parallel, and then run downstream jobs such as plot_matrix_heatmap.sh and parse_pangenome_matrix.pl as it looks like they would be dependent on completion of the get_homologues-est.pl jobs.

In my study, I'm working on the barley pangenome, much similar to what you have done before (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5306281/). I have 20 genomes, 3 of which have annotations which have been mapped to all the other genomes, so using their CDS and protein sequences in get_homologues-est would likely be limited. I also have a limited source of RNA-seq data I could try to assemble, if I manage to with our compute resources, and I could also include sequences from your barley study to expand on the pangenome.

I've decided to start with the CDS and protein sequences annotated from all 20 genomes, to see what it looks like, and then add transcripts from other sources later.

I've been following the test_barley example and tutorial from here: http://eead-csic-compbio.github.io/get_homologues/tutorial/pangenome_tutorial.html#41_example_protocol

I've been following the steps from the tutorial here:

get_homologues-est.pl -d cds -D -m cluster -o &> log.cds.pfam
get_homologues-est.pl -d cds -M -t 0 -m cluster &> log.cds
get_homologues-est.pl -d cds -M -t 3 -m cluster &> log.cds.t3
get_homologues-est.pl -d cds -M -t 10 -m cluster -A -e &> log.cds.t10.e
get_homologues-est.pl -d cds -c -z -I cds/leaf.list -M -t 3 -m cluster &> log.cds.leaf.t3.c
plot_matrix_heatmap.sh -M plot_matrix_heatmap.sh -i cds_est_homologues/Alexis_10taxa_algOMCL_e1_Avg_identity.tab \
  -H 10 -W 15 -t "ANI of single-copy transcripts (occupancy > 9)" -N -o pdf
compare_clusters.pl -d cds_est_homologues/Alexis_0taxa_algOMCL_e0_ -o clusters_cds -m -n &> log.compare_clusters.cds
compare_clusters.pl -d cds_est_homologues/Alexis_3taxa_algOMCL_e0_ -o clusters_cds_t3 -m -n &> log.compare_clusters.cds.t3

parse_pangenome_matrix.pl -m clusters_cds_t3/pangenome_matrix_t0.tab -s &> log.parse_pangenome_matrix.cds.t3
plot_pancore_matrix.pl -i cds_est_homologues/core_genome_leaf.list_algOMCL.tab -f core_both &> log.core.plots
plot_pancore_matrix.pl -i cds_est_homologues/pan_genome_leaf.list_algOMCL.tab -f pan &> log.pan.plots

parse_pangenome_matrix.pl -m clusters_cds_t3/pangenome_matrix_t0.tab -A cds/SBCC073.list -B cds/ref.list -g &> log.acc.SBCC073
mv clusters_cds_t3/pangenome_matrix_t0__pangenes_list.txt clusters_cds_t3/SBCC073_pangenes_list.txt clusters_cds_t3/pangenome_matrix_t0.tab 
parse_pangenome_matrix.pl -m clusters_cds_t3/pangenome_matrix_t0.tab -A cds/Scarlett.list -B cds/ref.list -g &> log.acc.Scarlett 
mv clusters_cds_t3/pangenome_matrix_t0__pangenes_list.txt clusters_cds_t3/Scarlett_pangenes_list.txt parse_pangenome_matrix.pl -m clusters_cds_t3/pangenome_matrix_t0.tab -A cds/spontaneum.list -B cds/ref.list -g &> log.acc.spontaneum
mv clusters_cds_t3/pangenome_matrix_t0__pangenes_list.txt \
  clusters_cds_t3/spontaneum_pangenes_list.txt
pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
    -x clusters_cds_t3/pangenome_matrix_t0__core_list.txt -e -p 1 \
    -r SBCC073 > SBCC073_core.pfam.enrich.tab
pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
    -x clusters_cds_t3/pangenome_matrix_t0__core_list.txt -e -p 1 \
    -r SBCC073 -t less > SBCC073_core.pfam.deplet.tab
pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
    -x clusters_cds_t3/SBCC073_pangenes_list.txt -e -p 1 -r SBCC073 \
    -f SBCC073_accessory.fna > SBCC073_accessory.pfam.enrich.tab

pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
    -x clusters_cds_t3/Scarlett_pangenes_list.txt -e -p 1 -r Scarlett \
    -f Scarlett_accessory.fna > Scarlett_accessory.pfam.enrich.tab
pfam_enrich.pl -d cds_est_homologues -c clusters_cds -n \
    -x clusters_cds_t3/spontaneum_pangenes_list.txt -e -p 1 -r Hs_ \
    -f spontaneum_accessory.fna > spontaneum_accessory.pfam.enrich.tab
perl suppl_scripts/_add_Pfam_domains.pl > accessory_stats.tab
perl -lane 'print if($F[0] >= 5 || $F[1] >= 5 || $F[2] >= 5)' \
  accessory_stats.tab  > accessory_stats_min5.tab
Rscript suppl_scripts/_plot_heatmap.R

In my cds folder I have the following files:

ubuntu@node-0:/data/pangenome_cluster_analysis$ ls -lhtr cds
total 1.3G
-rwxrwxrwx 1 ubuntu ubuntu  47M Jul  7 05:57 Akashinriki.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  47M Jul  7 05:57 B1K-04-12.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  49M Jul  7 05:57 Barke.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  45M Jul  7 05:57 Golden_Promise.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  49M Jul  7 05:57 HOR_10350.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  47M Jul  7 05:57 HOR_13821.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  47M Jul  7 05:57 HOR_13942.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  47M Jul  7 05:57 HOR_21599.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  48M Jul  7 05:57 HOR_3081.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  48M Jul  7 05:57 HOR_3365.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  47M Jul  7 05:57 HOR_7552.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  48M Jul  7 05:57 HOR_8148.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  48M Jul  7 05:57 HOR_9043.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  50M Jul  7 05:57 Hockett.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  48M Jul  7 05:57 Igri.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  49M Jul  7 05:57 Morex.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  47M Jul  7 05:57 OUN333.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  48M Jul  7 05:57 RGT_Planet.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  47M Jul  7 05:57 ZDM01467.cds.fna
-rwxrwxrwx 1 ubuntu ubuntu  48M Jul  7 05:57 ZDM02064.cds.fna
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 Akashinriki.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 B1K-04-12.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 Barke.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  16M Aug 19 03:54 Golden_Promise.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 HOR_10350.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 HOR_13821.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 HOR_13942.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 HOR_21599.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 HOR_3081.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 HOR_3365.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 HOR_7552.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 HOR_8148.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 HOR_9043.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  18M Aug 19 03:54 Hockett.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 Igri.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  18M Aug 19 03:54 Morex.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 OUN333.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 RGT_Planet.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 ZDM01467.protein.faa
-rwxrwxr-x 1 ubuntu ubuntu  17M Aug 19 03:54 ZDM02064.protein.faa

However, when I run this command:

singularity exec --bind ${PWD}:${PWD} /data/get_homologues_builds/get_homologues.sif get_homologues-est.pl -n 8 -d cds -D -m local -o

I get this error:

# /get_homologues/get_homologues-est.pl -d cds -o 1 -i 40 -e 0 -r 0 -t all -c 0 -z 0 -I 0 -m local -n 8 -M 0 -C 75 -S 95 -E 1e-05 -F 1.5 -b 0 -s 0 -D 1 -R 0 -L 0 -A 0 -P 0

# version 31072020
# results_directory=/data/pangenome_cluster_analysis/cds_est_homologues
# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=5000 BATCHSIZE=1000 MINSEQLENGTH=20 MAXSEQLENGTH=25000

# checking input files...
# Akashinriki.cds.fna 44446 
# WARNING: -D Pfam domain scans cannot be performed without input protein sequences (please check the manual)

If I remove -D it starts running. When the -D parameter is specified it should start to find Pfam domains, and I've provided protein sequences of them in the cds/ folder as *.faa. Should I specificy the protein sequences with a different parameter? I see get_homologues.pl has a -i parameter for amino acid sequences, but get_homologues-est.pl doesn't.

I also have a question about the -t parameter for the number of taxa. In your example you run with 0, then 3, then 10, and then 3 again when claculating the average nucleotide identity. Is your workflow unique to your dataset? Should I use -t 20 considering I have 20 varieties? I'm trying to understand each step, and how I can relate it to my study, and which parmeters I should use in additon when I bring in more transcripts (either after I've assembled, or included from your barley study). I can understand using -t 0 as a control, but then jumping to -t 3 and then -t 10 must be based on your selection of varieties. I also have a phylogenetic tree I prepared of the varieties using mashtree if that helps in determining the -t parameters and any other parameters to use (see attached).

Also I saw mention from the parameters list that the taxa name should be in the sequence headers. My sequence headers in the .fna and .faa files look like this:

Horvu_AKASHIN_1H01G000100.1

I was wondering if the taxa name needs to be seprated by space from the sequence name, in which case I'll modify the headers.

Thanks for your help.

pangenome_tree_bootstrap

brunocontrerasmoreira commented 4 years ago

Hi, let me know how it goes with singularity. I am afraid you can split a single get_homologues-est.pl job across several script calls. Instead you should use -m cluster or -n for parallel BLAST/HMMER jobs.

Your CDS sequences should have a matching name. In your case, this could be: Akashinriki.cds.fna Akashinriki.cds.faa

Note the sequences in those twin files should be in the same order. If I recall correctly, taxon names can be indicated in FASTA files in square brackets:

Horvu_AKASHIN_1H01G000100.1 [AKASHIN]

Hope this helps

brettChapman commented 4 years ago

Hi, thanks for your help. I didn't mean splitting a single script across multiple nodes. I meant submitting different jobs (steps 1 to 6) all at the same time on different nodes).

I've managed to get it running with Singularity using the following workflow on my 20 genomes:

## Step 1: calculate protein domain frequencies (Pfam)
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${GET_HOMOLOGUES_IMAGE} get_homologues-est.pl -d cds -D -n ${SLURM_NTASKS_PER_NODE} -o &> log.cds.pfam

## Step 2: calculate 'control' cds clusters
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${GET_HOMOLOGUES_IMAGE} get_homologues-est.pl -d cds -M -t 0 -n ${SLURM_NTASKS_PER_NODE} &> log.cds

## Step 3: get non-cloud clusters
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${GET_HOMOLOGUES_IMAGE} get_homologues-est.pl -d cds -M -t 3 -n ${SLURM_NTASKS_PER_NODE} &> log.cds.t3

## Step 4: single-copy clusters with high occupancy & Average Nucleotide Identity [Note that flag -e leaves out clusters with inparalogues]
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${GET_HOMOLOGUES_IMAGE} get_homologues-est.pl -d cds -M -t 10 -n ${SLURM_NTASKS_PER_NODE} -A -e &> log.cds.t10.e

## Step 5: clusters for dN/dS calculations
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${GET_HOMOLOGUES_IMAGE} get_homologues-est.pl -d cds -e -M -t 4 -n ${SLURM_NTASKS_PER_NODE} &> log.cds.t20.e

## Step 6: pangenome growth simulations with soft-core
srun -n 1 singularity exec --bind ${PWD}:${PWD} ${GET_HOMOLOGUES_IMAGE} get_homologues-est.pl -d cds -c -z -M -t 3 -n ${SLURM_NTASKS_PER_NODE} &> log.cds.t3.c

I first tried to run all 6 jobs (steps) on separate nodes, but it appears get_homologues-est wont run jobs while its running another job, so I've had to add dependencies like so:

job1=$(sbatch submit_get_homologous-est_step1.sh | cut -d ' ' -f 4)
job2=$(sbatch --dependency=afterok:$job1 submit_get_homologous-est_step2.sh | cut -d ' ' -f 4)
job3=$(sbatch --dependency=afterok:$job2 submit_get_homologous-est_step3.sh | cut -d ' ' -f 4)
job4=$(sbatch --dependency=afterok:$job3 submit_get_homologous-est_step4.sh | cut -d ' ' -f 4)
job5=$(sbatch --dependency=afterok:$job4 submit_get_homologous-est_step5.sh | cut -d ' ' -f 4)
sbatch --dependency=afterok:$job5 submit_get_homologous-est_step6.sh

Is this the same way it would run if I had -m cluster instead of -n 16, i.e each step is run in that order from step 1 to 6, or would they all run concurrently on the cluster?

In regards to the -t parameter taxa, is that related to "occupancy" referred to in your paper? Does increasing the -t value basically increase the minimum number of varieties allowed in a cluster, with -t 0 removing that limit? i.e. limiting the core gene sets to x number of varieties? Why is it important to have different values of -t for different jobs? I feel like I'll likely have to play around with the -t value to get a feel for how it impacts the results.

Thanks.

brunocontrerasmoreira commented 4 years ago

Hi @brettChapman , -t can be used to indicate occupancy as you correctly guessed. With -t 0 there is no restriction and thus you can get clusters with any occupancy, including singletons found in only one genome/proteome. If you have 20 genomes, -t 20 make sure you only get strict core clusters with sequences found in all of them.

eead-csic-compbio commented 4 years ago

Closed this after https://github.com/eead-csic-compbio/get_homologues/issues/67