Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

Docker Image for Snakemake Not Being Downloaded to Compute Nodes #268

Closed markpearl closed 5 years ago

markpearl commented 5 years ago

I have come across an issue where my tasks submitted using the Batch Shipyard command line tools are stuck in the Preparing state because the Docker image they depend on is not present on the Batch compute node.

After dealing with the issue for quite some time, I created an admin user on the compute node and logged in using ssh. I manually downloaded my Docker image, which is placed in a private registry, and then ran my task again. This time the task failed again and is hanging.

Provided are my configuration files:

config.yaml image

image

image

image

Dockerfile used to create the image:

FROM biocontainers/biocontainers:latest

RUN conda install trimmomatic

WORKDIR /home/mjpearl/fileshare

CMD ["trimmomatic"]

This image is stored in my container registry under the trimmomatic repo and trimmomatic for the tag (i.e. trimmomatic: trimmomatic ).

As per azure-hpc example, my snakemake workflow file is creating a shell script with the following command:

mjpearl@controlvm:~/fileshare/snakemake$ $SHIPYARD/shipyard pool add --configdir $FILESHARE/snakemake/azurebatch

The pool get's created successfully, and as soon as I see that the pool reaches "starting" stats i Abort the command.

When I run the snakemake workflow the job just hangs:

image

For the batch account, the start task for the dedicated node seems to fail:

image

Also provided are the stderr.txt and stdout.txt, respectively:

Stderr.txt: mount error(2): No such file or directory Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)

Stdout.txt: Linux 7bdca5853a074102b5577ae7ba27e2e1000000 4.18.0-1013-azure #13~18.04.1-Ubuntu SMP Thu Feb 28 23:48:47 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux 2019-03-18T04:51:49,180846313+00:00 - INFO - Prep start

Configuration: Custom image: 0 Native mode: 0 OS Distribution: ubuntu 18.04 Batch Shipyard version: 3.7.0 Blobxfer version: 1.7.0 Singularity version: User mountpoint: /mnt Mount path: /mnt/batch/tasks/mounts Batch Insights: 0 Prometheus: NE=, CA=, Network optimization: 0 Encryption cert thumbprint: Install Kata Containers: 0 Default container runtime: runc Install BeeGFS BeeOND: 0 Storage cluster mount: Custom mount: Install LIS: GPU: Azure Blob: 0 Azure File: 1 GlusterFS on compute: 0 HPN-SSH: 0 Enable Azure Batch group for Docker access: Fallback registry: Docker image preload delay: 0 Cascade via container: 1 P2P: 0 Block on images: trimmomatic/trimmomatic#

2019-03-18T04:51:49,296618240+00:00 - INFO - LIS installation not required 2019-03-18T04:51:49,305637489+00:00 - INFO - Mounting Azure File

Provided is the snakemake file:

Snakefile for the RNA-Seq analysis pipeline using test data from zebrafish

You should not need to edit this file unless you are changing the programs in the pipeline

configfile: "config_zebrafish.yaml" SAMPLES = config['samples']

R1_suffix=config['input_file_R1_suffix'] R2_suffix=config['input_file_R2_suffix'] genome_fasta_file = config['genome_fasta_file'] genome_index_base = config['genome_index_base'] merged_transcripts_file=config['merged_transcripts_file']

rule all: input: abundance_table=expand("4_transcript_abundances/{sample}/{sample}.abundance_table.txt", sample=SAMPLES)

rule trim_and_qc_all: input: html=expand("1_fastqc_reports/{sample}_R1.trimmed_paired_fastqc.html", sample=SAMPLES)

rule assemble_all: input: gtf=expand("3_assembled_transcripts/{sample}.gtf", sample=SAMPLES)

rule map_all: input: bam=expand("2_mapped_reads/{sample}.sorted.bam", sample=SAMPLES)

rule trim_reads: input: R1_reads="/home/mjpearl/fileshare/data/{sample}" + R1_suffix, R2_reads="/home/mjpearl/fileshare/data/{sample}" + R2_suffix
output: "1_trimmed_reads/{sample}_R1.trimmed_paired.fastq", "1_trimmed_reads/{sample}_R1.trimmed_unpaired.fastq", "1_trimmed_reads/{sample}_R2.trimmed_paired.fastq", "1_trimmed_reads/{sample}_R2.trimmed_unpaired.fastq"
threads: config['threads'] params: run_params=config['trimmomatic_params'] shell: "echo -e \"#!/usr/bin/env bash\ncd $FILESHARE/snakemake;\n trimmomatic PE -threads {threads} {input.R1_reads} {input.R2_reads} {output} {params.run_params}\" > $FILESHARE/snakemake/jobrun.sh ;\n $SHIPYARD/shipyard jobs add --configdir $FILESHARE/snakemake/azurebatch --tail stderr.txt\n"

jobrun.sh get's created in the snakemake directory and created the following command:

trimmomatic PE -threads 1 /home/mjpearl/fileshare/data/zebrafish_6h_1.fastq /home/mjpearl/fileshare/data/zebrafish_6h_2.fastq 1_trimmed_reads/zebrafish_6h_R1.trimmed_paired.fastq 1_trimmed_reads/zebrafish_6h_R1.trimmed_unpaired.fastq 1_trimmed_reads/zebrafish_6h_R2.trimmed_paired.fastq 1_trimmed_reads/zebrafish_6h_R2.trimmed_unpaired.fastq HEADCROP:10 SLIDINGWINDOW:4:20 MINLEN:36

Any help would be appreciated.

alfpark commented 5 years ago

You have two issues here as far as I can tell:

  1. The compute node preparation task is failing when mounting your Azure File Share. Please ensure that the file share named fileshare exists in storage account agcanrnaseqdiag. When the compute node mounts the specified file share, it must exist in Azure Storage (as per documentation).
  2. Your Docker image reference is incorrect. The Docker image must be fully qualified as per documentation. As-is in your config.yaml and jobs.yaml file, the image pull attempt would happen against Docker Hub, certainly not what you want in your case. You need to modify the image name in both yaml files to be: agcanregistry.azurecr.io/trimmomatic/trimmomatic.
markpearl commented 5 years ago

Thanks Fred. I've amended the config json to the following:

image

For the snakemake pipeline, I mounted two directories onto the VM where snakemake is installed. When clicking on agcanfileshare below you them:

image

image

Is it okay to specify the top most directory for the fileshare or will I need to be specific and specify /agcanfileshare/data? What is the purpose of the azurefilevol in the overall execution of the job?

Inside my job.yaml, the command being run is referencing a directory path of the physical virtual machine where snakemake and shipyard are installed, not the compute node. Would I need to change the command to reference a directory from the fileshare?

alfpark commented 5 years ago

The azure_file_share_name must be the file share name and not any subdirectories. Those will appear as normal directories once it's mounted. So in your example, while the task/container is running, you would have:

/agcanfileshare
  |- data
  |- genome

I can't answer what the purpose for this fileshare is, as that is something your program/task logic dictates. Presumably you're mounting the Azure File Share for your task to access requisite data.

Finally, the task command specified in jobs.yaml are always in the context of the running container, which is on a compute node in the Batch pool.

markpearl commented 5 years ago

Hi Fred,

Seem to have progress compared to the last attempt. When looking at the stderr.txt and stdout.txt, it seems as though the docker image is getting pulled successfully.

Stderr.txt:

Warning: apt-key output should not be parsed (stdout is not a terminal) Warning: Stopping docker.service, but it can still be activated by: docker.socket Synchronizing state of docker.service with SysV service script with /lib/systemd/systemd-sysv-install. Executing: /lib/systemd/systemd-sysv-install disable docker WARNING: API is accessible on http://127.0.0.1:2375 without encryption. Access to the remote API is equivalent to root access on the host. Refer to the 'Docker daemon attack surface' section in the documentation for more information: https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface WARNING: No swap limit support WARNING: Published ports are discarded when using host network mode

Stdout.txt:

Linux a3168ddbca384fe79eab806d892c3978000000 4.18.0-1013-azure #13~18.04.1-Ubuntu SMP Thu Feb 28 23:48:47 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux 2019-03-18T15:45:45,215905303+00:00 - INFO - Prep start Configuration:

Custom image: 0 Native mode: 0 OS Distribution: ubuntu 18.04 Batch Shipyard version: 3.7.0 Blobxfer version: 1.7.0 Singularity version: User mountpoint: /mnt Mount path: /mnt/batch/tasks/mounts Batch Insights: 0 Prometheus: NE=, CA=, Network optimization: 0 Encryption cert thumbprint: Install Kata Containers: 0 Default container runtime: runc Install BeeGFS BeeOND: 0 Storage cluster mount: Custom mount: Install LIS: GPU: Azure Blob: 0 Azure File: 1 GlusterFS on compute: 0 HPN-SSH: 0 Enable Azure Batch group for Docker access: Fallback registry: Docker image preload delay: 0 Cascade via container: 1 P2P: 0 Block on images: agcancregistry.azurecr.io/trimmomatic/trimmomatic#

2019-03-18T15:45:45,337860866+00:00 - INFO - LIS installation not required 2019-03-18T15:45:45,347078216+00:00 - INFO - Mounting Azure File Shares 2019-03-18T15:45:46,070012124+00:00 - DEBUG - Installing Docker Host Engine Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB] Hit:2 http://azure.archive.ubuntu.com/ubuntu bionic InRelease Get:3 http://azure.archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB] Get:4 http://azure.archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB] Get:5 http://azure.archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [556 kB] Get:6 http://azure.archive.ubuntu.com/ubuntu bionic-updates/main Translation-en [207 kB] Get:7 http://azure.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [744 kB] Get:8 http://azure.archive.ubuntu.com/ubuntu bionic-updates/universe Translation-en [193 kB] Get:9 http://azure.archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [6,384 B] Get:10 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [282 kB] Get:11 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [127 kB] Get:12 http://security.ubuntu.com/ubuntu bionic-security/universe Translation-en [71.8 kB] Get:13 http://security.ubuntu.com/ubuntu bionic-security/multiverse amd64 Packages [3,748 B] Fetched 2,443 kB in 3s (824 kB/s) Reading package lists... Reading package lists... Building dependency tree... Reading state information... ca-certificates is already the newest version (20180409). curl is already the newest version (7.58.0-2ubuntu3.6). software-properties-common is already the newest version (0.96.24.32.7). The following NEW packages will be installed: apt-transport-https gnupg2 0 upgraded, 2 newly installed, 0 to remove and 10 not upgraded. Need to get 6,360 B of archives. After this operation, 205 kB of additional disk space will be used. Get:1 http://azure.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 apt-transport-https all 1.6.8 [1,692 B] Get:2 http://azure.archive.ubuntu.com/ubuntu bionic-updates/universe amd64 gnupg2 all 2.2.4-1ubuntu1.2 [4,668 B] Fetched 6,360 B in 0s (160 kB/s) Selecting previously unselected package apt-transport-https. (Reading database ... (Reading database ... 5% (Reading database ... 10% (Reading database ... 15% (Reading database ... 20% (Reading database ... 25% (Reading database ... 30% (Reading database ... 35% (Reading database ... 40% (Reading database ... 45% (Reading database ... 50% (Reading database ... 55% (Reading database ... 60% (Reading database ... 65% (Reading database ... 70% (Reading database ... 75% (Reading database ... 80% (Reading database ... 85% (Reading database ... 90% (Reading database ... 95% (Reading database ... 100% (Reading database ... 55599 files and directories currently installed.) Preparing to unpack .../apt-transport-https_1.6.8_all.deb ... Unpacking apt-transport-https (1.6.8) ... Selecting previously unselected package gnupg2. Preparing to unpack .../gnupg2_2.2.4-1ubuntu1.2_all.deb ... Unpacking gnupg2 (2.2.4-1ubuntu1.2) ... Setting up apt-transport-https (1.6.8) ... Setting up gnupg2 (2.2.4-1ubuntu1.2) ... Processing triggers for man-db (2.8.3-2ubuntu0.1) ... OK Hit:1 http://azure.archive.ubuntu.com/ubuntu bionic InRelease Hit:2 http://azure.archive.ubuntu.com/ubuntu bionic-updates InRelease Hit:3 http://azure.archive.ubuntu.com/ubuntu bionic-backports InRelease Hit:4 http://security.ubuntu.com/ubuntu bionic-security InRelease Get:5 https://download.docker.com/linux/ubuntu bionic InRelease [64.4 kB] Get:6 https://download.docker.com/linux/ubuntu bionic/stable amd64 Packages [5,195 B] Fetched 69.6 kB in 2s (41.2 kB/s) Reading package lists... Hit:1 http://security.ubuntu.com/ubuntu bionic-security InRelease Hit:2 http://azure.archive.ubuntu.com/ubuntu bionic InRelease Hit:3 http://azure.archive.ubuntu.com/ubuntu bionic-updates InRelease Hit:4 http://azure.archive.ubuntu.com/ubuntu bionic-backports InRelease Hit:5 https://download.docker.com/linux/ubuntu bionic InRelease Reading package lists... Reading package lists... Building dependency tree... Reading state information... The following additional packages will be installed: containerd.io docker-ce-cli Recommended packages: aufs-tools cgroupfs-mount | cgroup-lite pigz libltdl7 The following NEW packages will be installed: containerd.io docker-ce docker-ce-cli 0 upgraded, 3 newly installed, 0 to remove and 10 not upgraded. Need to get 50.5 MB of archives. After this operation, 242 MB of additional disk space will be used. Get:1 https://download.docker.com/linux/ubuntu bionic/stable amd64 containerd.io amd64 1.2.4-1 [19.9 MB] Get:2 https://download.docker.com/linux/ubuntu bionic/stable amd64 docker-ce-cli amd64 5:18.09.3~3-0~ubuntu-bionic [13.1 MB] Get:3 https://download.docker.com/linux/ubuntu bionic/stable amd64 docker-ce amd64 5:18.09.2~3-0~ubuntu-bionic [17.4 MB] Fetched 50.5 MB in 4s (11.8 MB/s) Selecting previously unselected package containerd.io. (Reading database ... (Reading database ... 5% (Reading database ... 10% (Reading database ... 15% (Reading database ... 20% (Reading database ... 25% (Reading database ... 30% (Reading database ... 35% (Reading database ... 40% (Reading database ... 45% (Reading database ... 50% (Reading database ... 55% (Reading database ... 60% (Reading database ... 65% (Reading database ... 70% (Reading database ... 75% (Reading database ... 80% (Reading database ... 85% (Reading database ... 90% (Reading database ... 95% (Reading database ... 100% (Reading database ... 55610 files and directories currently installed.) Preparing to unpack .../containerd.io_1.2.4-1_amd64.deb ... Unpacking containerd.io (1.2.4-1) ... Selecting previously unselected package docker-ce-cli. Preparing to unpack .../docker-ce-cli_5%3a18.09.3~3-0~ubuntu-bionic_amd64.deb ... Unpacking docker-ce-cli (5:18.09.3~3-0~ubuntu-bionic) ... Selecting previously unselected package docker-ce. Preparing to unpack .../docker-ce_5%3a18.09.2~3-0~ubuntu-bionic_amd64.deb ... Unpacking docker-ce (5:18.09.2~3-0~ubuntu-bionic) ... Setting up containerd.io (1.2.4-1) ... Created symlink /etc/systemd/system/multi-user.target.wants/containerd.service → /lib/systemd/system/containerd.service. Processing triggers for ureadahead (0.100.0-20) ... Processing triggers for systemd (237-3ubuntu10.15) ... Processing triggers for man-db (2.8.3-2ubuntu0.1) ... Setting up docker-ce-cli (5:18.09.3~3-0~ubuntu-bionic) ... Setting up docker-ce (5:18.09.2~3-0~ubuntu-bionic) ... update-alternatives: using /usr/bin/dockerd-ce to provide /usr/bin/dockerd (dockerd) in auto mode Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /lib/systemd/system/docker.service. Created symlink /etc/systemd/system/sockets.target.wants/docker.socket → /lib/systemd/system/docker.socket. Processing triggers for ureadahead (0.100.0-20) ... Processing triggers for systemd (237-3ubuntu10.15) ... ● docker.service - Docker Application Container Engine Loaded: loaded (/lib/systemd/system/docker.service; disabled; vendor preset: enabled) Active: active (running) since Mon 2019-03-18 15:47:38 UTC; 40ms ago Docs: https://docs.docker.com Main PID: 4862 (dockerd) Tasks: 8 CGroup: /system.slice/docker.service └─4862 /usr/bin/dockerd

Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 dockerd[4862]: time="2019-03-18T15:47:38.060819467Z" level=warning msg="Your kernel does not support cgroup blkio weight_device" Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 dockerd[4862]: time="2019-03-18T15:47:38.062170280Z" level=info msg="Loading containers: start." Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 dockerd[4862]: time="2019-03-18T15:47:38.210840542Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address" Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 dockerd[4862]: time="2019-03-18T15:47:38.305518678Z" level=info msg="Loading containers: done." Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 dockerd[4862]: time="2019-03-18T15:47:38.400762162Z" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled" storage-driver=overlay2 Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 dockerd[4862]: time="2019-03-18T15:47:38.401301107Z" level=info msg="Docker daemon" commit=6247962 graphdriver(s)=overlay2 version=18.09.2 Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 dockerd[4862]: time="2019-03-18T15:47:38.401569029Z" level=info msg="Daemon has completed initialization" Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 systemd[1]: Started Docker Application Container Engine. Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 dockerd[4862]: time="2019-03-18T15:47:38.475514227Z" level=info msg="API listen on 127.0.0.1:2375" Mar 18 15:47:38 a3168ddbca384fe79eab806d892c3978000000 dockerd[4862]: time="2019-03-18T15:47:38.475772649Z" level=info msg="API listen on /var/run/docker.sock" Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 0 Server Version: 18.09.2 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: false Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e runc version: 6635b4f0c6af3810594d2770f662f34ddc15b40d init version: fec3683 Security Options: apparmor seccomp Profile: default Kernel Version: 4.18.0-1013-azure Operating System: Ubuntu 18.04.2 LTS OSType: linux Architecture: x86_64 CPUs: 1 Total Memory: 1.636GiB Name: a3168ddbca384fe79eab806d892c3978000000 ID: PKWG:VHNZ:BNBE:73G4:RAI7:QDGB:BNI7:3J3C:SFSY:OUM5:VUBD:SRI5 Docker Root Dir: /mnt/docker Debug Mode (client): false Debug Mode (server): false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false Product License: Community Engine

2019-03-18T15:47:38,722885762+00:00 - INFO - Docker Host Engine installed 2019-03-18T15:47:38+00:00 - DEBUG - Logging into 1 Docker registry servers... 2019-03-18T15:47:38+00:00 - DEBUG - Logging into Docker registry: agcancregistry.azurecr.io with user: agcancregistry WARNING! Using --password via the CLI is insecure. Use --password-stdin. WARNING! Your password will be stored unencrypted in /mnt/batch/tasks/startup/wd/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded 2019-03-18T15:47:39+00:00 - INFO - Docker registry logins completed. 2019-03-18T15:47:39+00:00 - WARNING - No Singularity registry servers found. 2019-03-18T15:47:39,398210717+00:00 - INFO - Batch Insights disabled. 2019-03-18T15:47:39,399162996+00:00 - INFO - Prometheus node exporter disabled. 2019-03-18T15:47:39,400156279+00:00 - INFO - Prometheus cAdvisor disabled. 2019-03-18T15:47:39,408025730+00:00 - INFO - IB device not found 2019-03-18T15:47:39,408999711+00:00 - DEBUG - Pulling Docker Image: alfpark/blobxfer:1.7.0 (fallback: 0) 1.7.0: Pulling from alfpark/blobxfer 6c40cc604d8e: Pulling fs layer 88045d3327a8: Pulling fs layer 6c40cc604d8e: Verifying Checksum 6c40cc604d8e: Download complete 88045d3327a8: Verifying Checksum 88045d3327a8: Download complete 6c40cc604d8e: Pull complete 88045d3327a8: Pull complete Digest: sha256:4332406cb7d813647d8a280bbeab36d860672e75a2ff2d4c6425c7d2e662ac13 Status: Downloaded newer image for alfpark/blobxfer:1.7.0 2019-03-18T15:47:57,722279901+00:00 - DEBUG - Pulling Docker Image: alfpark/batch-shipyard:3.7.0-cargo (fallback: 0) 3.7.0-cargo: Pulling from alfpark/batch-shipyard 6c40cc604d8e: Already exists 41d85d378931: Pulling fs layer d552aeff627a: Pulling fs layer 791fd7b1f7f6: Pulling fs layer 41d85d378931: Verifying Checksum 41d85d378931: Download complete 791fd7b1f7f6: Verifying Checksum 791fd7b1f7f6: Download complete 41d85d378931: Pull complete d552aeff627a: Verifying Checksum d552aeff627a: Download complete d552aeff627a: Pull complete 791fd7b1f7f6: Pull complete Digest: sha256:fff249857898740791c9a28bd7fad689d66cc840ec51ce52cfb79d93b1b876d3 Status: Downloaded newer image for alfpark/batch-shipyard:3.7.0-cargo 2019-03-18T15:48:16,058781096+00:00 - WARNING - Singularity version not specified, not installing 2019-03-18T15:48:16,065904388+00:00 - DEBUG - Kata containers not flagged for install 2019-03-18T15:48:16,066965347+00:00 - DEBUG - BeeGFS BeeOND not flagged for install 2019-03-18T15:48:16,217103621+00:00 - DEBUG - Pulling Docker Image: alfpark/batch-shipyard:3.7.0-cascade (fallback: 0) 3.7.0-cascade: Pulling from alfpark/batch-shipyard 6c40cc604d8e: Already exists a84b0b53c5f6: Pulling fs layer 0e4a1c953f65: Pulling fs layer acf81bf43e92: Pulling fs layer 1697c3605d3d: Pulling fs layer 1d3c53136279: Pulling fs layer 1697c3605d3d: Waiting 1d3c53136279: Waiting acf81bf43e92: Verifying Checksum acf81bf43e92: Download complete 0e4a1c953f65: Verifying Checksum 0e4a1c953f65: Download complete a84b0b53c5f6: Verifying Checksum a84b0b53c5f6: Download complete 1697c3605d3d: Verifying Checksum 1697c3605d3d: Download complete a84b0b53c5f6: Pull complete 1d3c53136279: Verifying Checksum 1d3c53136279: Download complete 0e4a1c953f65: Pull complete acf81bf43e92: Pull complete 1697c3605d3d: Pull complete 1d3c53136279: Pull complete Digest: sha256:120c498a1544b0caecd3cef35799526541610fd2536f45bfd6a8ec3ea0c2b8ae Status: Downloaded newer image for alfpark/batch-shipyard:3.7.0-cascade 2019-03-18T15:49:35,249137648+00:00 - DEBUG - Starting Cascade 2019-03-18T15:49:50UTC - DEBUG - Logging into 1 Docker registry servers... 2019-03-18T15:49:50UTC - DEBUG - Logging into Docker registry: agcancregistry.azurecr.io with user: agcancregistry WARNING! Using --password via the CLI is insecure. Use --password-stdin. WARNING! Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded 2019-03-18T15:49:50UTC - INFO - Docker registry logins completed. 2019-03-18T15:49:50UTC - WARNING - No Singularity registry servers found.

However, when the snakemake worklfow gets to the trimmomatic command the job just hangs. I was anticipating that since the docker image is pulled successfully that the command would run, but based on your feedback it would see that the "jobrun.sh" should be referencing a container on a compute node in the batch pool.

How would I be able to use this docker image from the batch pool effectively while keeping the shell script located on the physical vm of the batch-shipyard and snakemake installation?

alfpark commented 5 years ago

According to your logs, the node has not completed startup. There will be a line at the end of the log stating as such which is missing above. You can see if the state of the node is indeed idle and also check/paste the cascade.log file which will be in the startup/wd directory.

In order to share between your local VM and the compute node using the file share, you will need to mount the Azure File Share on both your local VM and the compute node and save your scripts in the file share and reference them in the task.

markpearl commented 5 years ago

Thanks Fred. Here's the output of the cascade.log:

2019-03-18 15:49:52,051.051Z INFO cascade.py::_setup_logger:159 21:MainThread logger initialized, log file: /mnt/batch/tasks/startup/wd/cascade.log 2019-03-18 15:49:52,060.060Z INFO cascade.py::main:1313 21:MainThread max concurrent downloads: 10 2019-03-18 15:49:52,060.060Z DEBUG cascade.py::main:1343 21:MainThread ip address: 10.0.0.4 2019-03-18 15:49:52,389.389Z DEBUG cascade.py::_direct_download_resources_async:758 21:MainThread blob lease 39978934-49df-4c80-8310-50a5c46043c7 acquired for resource docker: agcancregistry.azurecr.io/trimmomatic/trimmomatic 2019-03-18 15:49:52,389.389Z INFO cascade.py::_pull_and_save:522 21:Thread-1 pulling docker image agcancregistry.azurecr.io/trimmomatic/trimmomatic 2019-03-18 15:49:52,775.775Z ERROR cascade.py::run:458 21:Thread-1 docker pull failed: stdout=Using default tag: latest stderr=Error response from daemon: manifest for agcancregistry.azurecr.io/trimmomatic/trimmomatic:latest not found Traceback (most recent call last): File "cascade.py", line 456, in run self._pull_and_save() File "cascade.py", line 549, in _pull_and_save grtype, stdout, stderr)) RuntimeError: docker pull failed: stdout=Using default tag: latest stderr=Error response from daemon: manifest for agcancregistry.azurecr.io/trimmomatic/trimmomatic:latest not found

2019-03-18 15:49:52,790.790Z DEBUG cascade.py::run:478 21:Thread-1 blob lease released for docker:agcancregistry.azurecr.io/trimmomatic/trimmomatic 2019-03-18 15:49:53,392.392Z CRITICAL cascade.py::download_monitor_async:1113 21:MainThread Thread exceptions encountered, terminating 2019-03-18 15:49:53,392.392Z ERROR cascade.py::flush:141 21:MainThread <main.StandardStreamLogger object at 0x7f2092415c50> 2019-03-18 15:49:53,392.392Z ERROR cascade.py::write:137 21:MainThread Traceback (most recent call last):

2019-03-18 15:49:53,392.392Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 1377, in

2019-03-18 15:49:53,393.393Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,393.393Z ERROR cascade.py::write:137 21:MainThread main() 2019-03-18 15:49:53,393.393Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 1353, in main

2019-03-18 15:49:53,394.394Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,394.394Z ERROR cascade.py::write:137 21:MainThread distribute_global_resources(loop, blob_client, table_client, ipaddress) 2019-03-18 15:49:53,394.394Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 1285, in distribute_global_resources

2019-03-18 15:49:53,395.395Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,395.395Z ERROR cascade.py::write:137 21:MainThread loop, blob_client, table_client, ipaddress, nentities)) 2019-03-18 15:49:53,395.395Z ERROR cascade.py::write:137 21:MainThread File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete

2019-03-18 15:49:53,396.396Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,403.403Z ERROR cascade.py::write:137 21:MainThread return future.result() 2019-03-18 15:49:53,403.403Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 1115, in download_monitor_async

2019-03-18 15:49:53,403.403Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,403.403Z ERROR cascade.py::write:137 21:MainThread raise _THREAD_EXCEPTIONS[0] 2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 456, in run

2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread self._pull_and_save() 2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 549, in _pull_and_save

2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,405.405Z ERROR cascade.py::write:137 21:MainThread grtype, stdout, stderr)) 2019-03-18 15:49:53,405.405Z ERROR cascade.py::write:137 21:MainThread RuntimeError 2019-03-18 15:49:53,405.405Z ERROR cascade.py::write:137 21:MainThread : 2019-03-18 15:49:53,405.405Z ERROR cascade.py::write:137 21:MainThread docker pull failed: stdout=Using default tag: latest stderr=Error response from daemon: manifest for agcancregistry.azurecr.io/trimmomatic/trimmomatic:latest not found

2019-03-18 15:49:53,405.405Z ERROR cascade.py::flush:141 21:MainThread <main.StandardStreamLogger object at 0x7f2092415c50> 2019-03-18 15:49:53,405.405Z ERROR cascade.py::flush:141 21:MainThread <main.StandardStreamLogger object at 0x7f2092415c50>

It seems that because I changed the tag to trimmomatic instead of latest it's causing an error in the cascade log.

Will look into the feasibility of mounting on both the vm and the control node.

Mark

markpearl commented 5 years ago

Also can you please re-open this. It is saying that I closed it but I don't remember doing that.

Regards,

Mark

On Mon, Mar 18, 2019 at 12:42 PM Mark Pearl markpearl7@gmail.com wrote:

Thanks Fred. Here's the output of the cascade.log:

2019-03-18 15:49:52,051.051Z INFO cascade.py::_setup_logger:159 21:MainThread logger initialized, log file: /mnt/batch/tasks/startup/wd/cascade.log 2019-03-18 15:49:52,060.060Z INFO cascade.py::main:1313 21:MainThread max concurrent downloads: 10 2019-03-18 15:49:52,060.060Z DEBUG cascade.py::main:1343 21:MainThread ip address: 10.0.0.4 2019-03-18 15:49:52,389.389Z DEBUG cascade.py::_direct_download_resources_async:758 21:MainThread blob lease 39978934-49df-4c80-8310-50a5c46043c7 acquired for resource docker: agcancregistry.azurecr.io/trimmomatic/trimmomatic 2019-03-18 15:49:52,389.389Z INFO cascade.py::_pull_and_save:522 21:Thread-1 pulling docker image agcancregistry.azurecr.io/trimmomatic/trimmomatic 2019-03-18 15:49:52,775.775Z ERROR cascade.py::run:458 21:Thread-1 docker pull failed: stdout=Using default tag: latest stderr=Error response from daemon: manifest for agcancregistry.azurecr.io/trimmomatic/trimmomatic:latest not found Traceback (most recent call last): File "cascade.py", line 456, in run self._pull_and_save() File "cascade.py", line 549, in _pull_and_save grtype, stdout, stderr)) RuntimeError: docker pull failed: stdout=Using default tag: latest stderr=Error response from daemon: manifest for agcancregistry.azurecr.io/trimmomatic/trimmomatic:latest not found

2019-03-18 15:49:52,790.790Z DEBUG cascade.py::run:478 21:Thread-1 blob lease released for docker: agcancregistry.azurecr.io/trimmomatic/trimmomatic 2019-03-18 15:49:53,392.392Z CRITICAL cascade.py::download_monitor_async:1113 21:MainThread Thread exceptions encountered, terminating 2019-03-18 15:49:53,392.392Z ERROR cascade.py::flush:141 21:MainThread <main.StandardStreamLogger object at 0x7f2092415c50> 2019-03-18 15:49:53,392.392Z ERROR cascade.py::write:137 21:MainThread Traceback (most recent call last):

2019-03-18 15:49:53,392.392Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 1377, in

2019-03-18 15:49:53,393.393Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,393.393Z ERROR cascade.py::write:137 21:MainThread main() 2019-03-18 15:49:53,393.393Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 1353, in main

2019-03-18 15:49:53,394.394Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,394.394Z ERROR cascade.py::write:137 21:MainThread distribute_global_resources(loop, blob_client, table_client, ipaddress) 2019-03-18 15:49:53,394.394Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 1285, in distribute_global_resources

2019-03-18 15:49:53,395.395Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,395.395Z ERROR cascade.py::write:137 21:MainThread loop, blob_client, table_client, ipaddress, nentities)) 2019-03-18 15:49:53,395.395Z ERROR cascade.py::write:137 21:MainThread File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete

2019-03-18 15:49:53,396.396Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,403.403Z ERROR cascade.py::write:137 21:MainThread return future.result() 2019-03-18 15:49:53,403.403Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 1115, in download_monitor_async

2019-03-18 15:49:53,403.403Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,403.403Z ERROR cascade.py::write:137 21:MainThread raise _THREAD_EXCEPTIONS[0] 2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 456, in run

2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread self._pull_and_save() 2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread File "cascade.py", line 549, in _pull_and_save

2019-03-18 15:49:53,404.404Z ERROR cascade.py::write:137 21:MainThread 2019-03-18 15:49:53,405.405Z ERROR cascade.py::write:137 21:MainThread grtype, stdout, stderr)) 2019-03-18 15:49:53,405.405Z ERROR cascade.py::write:137 21:MainThread RuntimeError 2019-03-18 15:49:53,405.405Z ERROR cascade.py::write:137 21:MainThread : 2019-03-18 15:49:53,405.405Z ERROR cascade.py::write:137 21:MainThread docker pull failed: stdout=Using default tag: latest stderr=Error response from daemon: manifest for agcancregistry.azurecr.io/trimmomatic/trimmomatic:latest not found

2019-03-18 15:49:53,405.405Z ERROR cascade.py::flush:141 21:MainThread <main.StandardStreamLogger object at 0x7f2092415c50> 2019-03-18 15:49:53,405.405Z ERROR cascade.py::flush:141 21:MainThread <main.StandardStreamLogger object at 0x7f2092415c50>

It seems that because I changed the tag to trimmomatic instead of latest it's causing an error in the cascade log.

Will look into the feasibility of mounting on both the vm and the control node.

Mark

alfpark commented 5 years ago

As per the cascade.log file, you need to ensure that the docker image specified exists. Please double check the fully qualified docker image name. You should be able to issue docker pull on the specified image.

markpearl commented 5 years ago

It seems as the node is successfully being launched. Now the problem seems to be with the command for the job.yaml file:

image

I've mounted the snakemake folder onto the fileshare. So I'm able to access it on the control-node when I go to the mounts folder:

image

I can see the shell script getting generated, but it doesn't seem to be able to open the file on the physical vm with snakemake and shipyard installed. Is the shell command for the job.yaml expecting me to reference a different path? I don't understand how he was able to get this working for azure-hpc while keeping the execution of the shell within the physical vm itself.

markpearl commented 5 years ago

Another great option would be to how to reference the input and output files once they've been mounted on the compute node. At least being able to run the trimmomatic command by manually specifying the full command in the jobs.yaml would be great.

When I look in the compute node, I'm able to see the files where they've been mounted:

image

But it says file not found when I try and reference the files directly in the command for the jobs.yaml:

"trimmomatic PE -threads 1 root/mounts/azfile-agcanrnaseqdiag-agcanfileshare/data/zebrafish_6h_1.fastq root/mounts/azfile-agcanrnaseqdiag-agcanfileshare/data/zebrafish_6h_2.fastq root/mounts/azfile-agcanrnaseqdiag-agcanfileshare/1_trimmed_reads/zebrafish_6h_R1.trimmed_paired.fastq root/mounts/azfile-agcanrnaseqdiag-agcanfileshare/1_trimmed_reads/zebrafish_6h_R1.trimmed_unpaired.fastq root/mounts/azfile-agcanrnaseqdiag-agcanfileshare/1_trimmed_reads/zebrafish_6h_R2.trimmed_paired.fastq root/mounts/azfile-agcanrnaseqdiag-agcanfileshare/1_trimmed_reads/zebrafish_6h_R2.trimmed_unpaired.fastq HEADCROP:10 SLIDINGWINDOW:4:20 MINLEN:36"

If you could point me in the right direction on how to run the command it will be a lot of help.

alfpark commented 5 years ago

The container_path for the shared_data_volume in config.yaml is where your Azure File Share will show up in the Docker container when the task is executed. I don't know what you have it set to now, but from above that would be /agcanfileshare and not /home/mjpearl/fileshare.

markpearl commented 5 years ago

Thanks that seems to be help out!

One thing that's giving me an issue is having the command in jobs.yaml reference a shell script in the fileshare rather than having to hardcode the full command. When I hard code the full command in the jobs.yaml file like this it works:

command: "trimmomatic PE -threads 1 /agcanfileshare/data/zebrafish_6h_1.fastq /agcanfileshare/data/zebrafish_6h_2.fastq /agcanfileshare/snakemake/1_trimmed_reads/zebrafish_6h_R1.trimmed_paired.fastq /agcanfileshare/snakemake/1_trimmed_reads/zebrafish_6h_R1.trimmed_unpaired.fastq /agcanfileshare/snakemake/1_trimmed_reads/zebrafish_6h_R2.trimmed_paired.fastq /agcanfileshare/snakemake/1_trimmed_reads/zebrafish_6h_R2.trimmed_unpaired.fastq HEADCROP:10 SLIDINGWINDOW:4:20 MINLEN:36"

I would rather the "shell" command of the Snakemake file create a file with this command contained in it which I can execute from the jobs.yaml file.

This is what I have thus far in my snakemake file:

shell: "echo -e \"trimmomatic PE -threads {threads} {input.R1_reads} {input.R2_reads} {output} {params.run_params}\" > $FILESHARE/snakemake/jobrun.sh ;\n $SHIPYARD/shipyard jobs add --configdir $FILESHARE/snakemake/azurebatch/trimmomatic --tail stderr.txt\n"

It will write this whole command to a file in the snakemake directory, but every time I try and reference that file in the command it's not able to find it. I'm not sure if it's just a syntax issue with Azure batch.

Any suggestions on how I can do this?

alfpark commented 5 years ago

Sorry, I'm not sure I can help further here as this has evolved out of scope from the original issue. It looks now to be more an implementation detail of Snakemake rather than a Batch Shipyard issue. Please read the jobs configuration documentation for more information on environment_variables which may help here.

Closing as the original issue has been resolved.