ENCODE-DCC / atac-seq-pipeline

ENCODE ATAC-seq pipeline
MIT License
392 stars 174 forks source link

OSError: [Errno 30] Read-only file system #99

Closed alexlenail closed 5 years ago

alexlenail commented 5 years ago

Describe the bug

I'm hoping to run Cromwell in server mode, configured to dispatch singularity jobs to slurm. I have successfully run this pipeline with singularity without slurm. When I change the configuration to use SLURM, the workflow fails at the trim_adapters (first) step.

Working config (singularity without SLURM, modeled from backend.conf):

    singularity {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        script-epilogue = "sleep 5 && sync"
        concurrent-job-limit = 10
        runtime-attributes = """
          String singularity_container
          String? singularity_bindpath
        """
        submit = """
          ls ${singularity_container} $(echo ${singularity_bindpath} | tr , ' ') 1>/dev/null && (chmod u+x ${script} && SINGULARITY_BINDPATH=$(echo ${cwd} | sed 's/cromwell-executions/\n/g' | head -n1),${singularity_bindpath} singularity exec --home ${cwd} ${singularity_container} ${script} & echo $! && disown)
        """
        job-id-regex = "(\\d+)"
        check-alive = "ps -ef | grep -v grep | grep ${job_id}"
        kill = "kill -9 ${job_id}"
      }
    }

Config I'm switching to (also modeled from backend.conf):

    slurm_singularity {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        script-epilogue = "sleep 30"
        concurrent-job-limit = 32
        runtime-attributes = """
          Int cpu = 1
          Int? time
          Int? memory_mb
          String singularity_container
          String? singularity_bindpath
        """
        submit = """
          ls ${singularity_container} $(echo ${singularity_bindpath} | tr , ' ') 1>/dev/null && (sbatch \
          --export=ALL \
          -J ${job_name} \
          -D ${cwd} \
          -o ${out} \
          -e ${err} \
          ${"-t " + time*60} \
          -n 1 \
          --ntasks-per-node=1 \
          ${"--cpus-per-task=" + cpu} \
          ${"--mem=" + memory_mb} \
          --wrap "chmod u+x ${script} && SINGULARITY_BINDPATH=$(echo ${cwd} | sed 's/cromwell-executions/\n/g' | head -n1),${singularity_bindpath} singularity exec --home ${cwd} ${singularity_container} ${script}")
        """
        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "Submitted batch job (\\d+).*"
      }
    }

trim_adapter command being generated:

 ls ~/.singularity/atac-seq-pipeline-v1.1.7.simg $(echo  | tr , ' ') 1>/dev/null && (sbatch \
--export=ALL \
-J cromwell_fb9af287_trim_adapter \
-D /pool/data/cromwell-aals/cromwell-executions/atac/fb9af287-90ec-4459-8be5-2f3d74593213/call-trim_adapter/shard-0 \
-o /pool/data/cromwell-aals/cromwell-executions/atac/fb9af287-90ec-4459-8be5-2f3d74593213/call-trim_adapter/shard-0/execution/stdout \
-e /pool/data/cromwell-aals/cromwell-executions/atac/fb9af287-90ec-4459-8be5-2f3d74593213/call-trim_adapter/shard-0/execution/stderr \
-t 1440 \
-n 1 \
--ntasks-per-node=1 \
--cpus-per-task=2 \
--mem=12000 \
--wrap "chmod u+x /pool/data/cromwell-aals/cromwell-executions/atac/fb9af287-90ec-4459-8be5-2f3d74593213/call-trim_adapter/shard-0/execution/script && SINGULARITY_BINDPATH=$(echo /pool/data/cromwell-aals/cromwell-executions/atac/fb9af287-90ec-4459-8be5-2f3d74593213/call-trim_adapter/shard-0 | sed 's/cromwell-executions/\n/g' | head -n1), singularity exec --home /pool/data/cromwell-aals/cromwell-executions/atac/fb9af287-90ec-4459-8be5-2f3d74593213/call-trim_adapter/shard-0 ~/.singularity/atac-seq-pipeline-v1.1.7.simg /pool/data/cromwell-aals/cromwell-executions/atac/fb9af287-90ec-4459-8be5-2f3d74593213/call-trim_adapter/shard-0/execution/script")

Error:

 Traceback (most recent call last):
  File "/software/atac-seq-pipeline/src/encode_trim_adapter.py", line 265, in <module>
    main()
  File "/software/atac-seq-pipeline/src/encode_trim_adapter.py", line 168, in main
    pool = multiprocessing.Pool(num_process)
  File "/usr/lib/python2.7/multiprocessing/__init__.py", line 232, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 138, in __init__
    self._setup_queues()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 234, in _setup_queues
    self._inqueue = SimpleQueue()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 354, in __init__
    self._rlock = Lock()
  File "/usr/lib/python2.7/multiprocessing/synchronize.py", line 147, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1)
  File "/usr/lib/python2.7/multiprocessing/synchronize.py", line 75, in __init__
    sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)
OSError: [Errno 30] Read-only file system
ln: failed to access 'R1/*.fastq.gz': No such file or directory
ln: failed to access 'R2/*.fastq.gz': No such file or directory

My suspicion is that the SLURM task doesn't have the same privileges as cromwell does when SLURM runs the singularity container, but I'm not sure how to fix this.

OS/Platform and dependencies

leepc12 commented 5 years ago

How did you submit jobs (using REST API) to the cromwell server? Did you specify the workflow options JSON (workflow_opts/singularity.json) when you POST?

alexlenail commented 5 years ago

@leepc12 yes:

job submission:

curl -X POST --header "Accept: application/json" -v "0.0.0.0:8000/api/workflows/v1/batch" \
  -F workflowSource=@../../encode3-pipelines/atac-seq-pipeline/atac.wdl \
  -F workflowInputs=@atac_trial.json \
  -F workflowOptions=@../../encode3-pipelines/atac-seq-pipeline/workflow_opts/singularity.json

singularity.json:

{
    "default_runtime_attributes" : {
        "singularity_container" : "~/.singularity/atac-seq-pipeline-v1.1.7.simg"
    }
}
leepc12 commented 5 years ago

@zfrenchee: Thanks. Please post your atac_trial.json too. Also, please read item 11 on this doc. Data file directories must be defined in singularity.json to be bound to singularity.

alexlenail commented 5 years ago

@leepc12 thanks for your reply.

Could you clarify the format for "singularity_bindpath"? The docs say:

        "singularity_bindpath" : "/your/,YOUR_OWN_DATA_DIR1,YOUR_OWN_DATA_DIR1,..."

atac_trial.json:

[
    {
        "atac.title" : "epigenomics/1_fastq/6_protocol_selection/diMN32/ALS-0BUU_diMN32_rep1",
        "atac.description" : "",

        "atac.pipeline_type" : "atac",
        "atac.paired_end" : true,

        "atac.genome_tsv" : "/pool/data/cromwell-aals/encode3-pipelines/genome/hg38.tsv",

        "atac.fastqs_rep1_R1" : [ "/pool/data/globus/epigenomics/1_fastq/6_protocol_selection/diMN32/ALS-0BUU_diMN32_rep1_1.fastq" ],
        "atac.fastqs_rep1_R2" : [ "/pool/data/globus/epigenomics/1_fastq/6_protocol_selection/diMN32/ALS-0BUU_diMN32_rep1_2.fastq" ],

        "atac.multimapping" : 4,

        "atac.auto_detect_adapter" : true,

        "atac.smooth_win" : 73,

        "atac.enable_idr" : true,
        "atac.idr_thresh" : 0.05,

        "atac.enable_xcor" : true

    }
]
leepc12 commented 5 years ago

It's SINGULARITY_BINDPATH (https://singularity.lbl.gov/docs-mount#specifying-bind-paths), which is a comma separated directories to be bound to the container. Use /pool/data for your case.

{
    ...
    "singularity_bindpath" : "/pool/data"
    ...
}
alexlenail commented 5 years ago

Adding the "singularity_bindpath" doesn't fix the OSError:

2019-03-18 17:15:15,010 cromwell-system-akka.dispatchers.backend-dispatcher-150 INFO  - DispatchedConfigAsyncJobExecutionActor [UUID(6884f955)atac.trim_adapter:0:1]: executing: ls ~/.singularity/atac-seq-pipeline-v1.1.7.simg $(echo /pool/data | tr , ' ') 1>/dev/null && (sbatch \
--export=ALL \
-J cromwell_6884f955_trim_adapter \
-D /pool/data/cromwell-aals/cromwell-executions/atac/6884f955-006d-4211-b5c9-a084278c4691/call-trim_adapter/shard-0 \
-o /pool/data/cromwell-aals/cromwell-executions/atac/6884f955-006d-4211-b5c9-a084278c4691/call-trim_adapter/shard-0/execution/stdout \
-e /pool/data/cromwell-aals/cromwell-executions/atac/6884f955-006d-4211-b5c9-a084278c4691/call-trim_adapter/shard-0/execution/stderr \
-t 1440 \
-n 1 \
--ntasks-per-node=1 \
--cpus-per-task=2 \
--mem=12000 \
--wrap "chmod u+x /pool/data/cromwell-aals/cromwell-executions/atac/6884f955-006d-4211-b5c9-a084278c4691/call-trim_adapter/shard-0/execution/script && SINGULARITY_BINDPATH=$(echo /pool/data/cromwell-aals/cromwell-executions/atac/6884f955-006d-4211-b5c9-a084278c4691/call-trim_adapter/shard-0 | sed 's/cromwell-executions/\n/g' | head -n1),/pool/data singularity exec --home /pool/data/cromwell-aals/cromwell-executions/atac/6884f955-006d-4211-b5c9-a084278c4691/call-trim_adapter/shard-0 ~/.singularity/atac-seq-pipeline-v1.1.7.simg /pool/data/cromwell-aals/cromwell-executions/atac/6884f955-006d-4211-b5c9-a084278c4691/call-trim_adapter/shard-0/execution/script")

(note /pool/data in SINGULARITY_BINDPATH)

Gives same error as above.

workflow_opts/singularity.json:

{
    "default_runtime_attributes" : {
        "singularity_container" : "~/.singularity/atac-seq-pipeline-v1.1.7.simg",
        "singularity_bindpath" : "/pool/data"
    }
}
alexlenail commented 5 years ago

I think I solved this by changing the slurm_singularity configuration

    slurm_singularity {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        script-epilogue = "sleep 30"
        concurrent-job-limit = 32
        runtime-attributes = """
          Int cpu = 1
          Int? time
          Int? memory_mb
          String singularity_container
          String? singularity_bindpath
        """
        submit = """
          ls ${singularity_container} $(echo ${singularity_bindpath} | tr , ' ') 1>/dev/null && (sbatch \
          --export=ALL \
          -J ${job_name} \
          -D ${cwd} \
          -o ${out} \
          -e ${err} \
          ${"-t " + time*60} \
          -n 1 \
          --ntasks-per-node=1 \
          ${"--cpus-per-task=" + cpu} \
          ${"--mem=" + memory_mb} \
          --wrap "chmod u+x ${script} && SINGULARITY_BINDPATH=$(echo ${cwd} | sed 's/cromwell-executions/\n/g' | head -n1),${singularity_bindpath}:rw singularity exec --home ${cwd} ${singularity_container} ${script}")
        """
        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "Submitted batch job (\\d+).*"
      }
    }

Specifically, in the command, see:

SINGULARITY_BINDPATH=$(echo ${cwd} | sed 's/cromwell-executions/\n/g' | head -n1),${singularity_bindpath}:rw

in particular, ${singularity_bindpath}:rw

It seems I needed to specify rw on my bind mount. This might be appropriate to either include in backends/backend.conf by default, or leave a comment.

(I may not have solved this, but the pipeline has been running for a while...)

leepc12 commented 5 years ago

I think your pipeline is just hanging for a while since SINGULARITY_BINDPATH=/pool/data:rw is not a valid syntax. Did you pipeline pass the read_genome_tsv task (check if status is Done)? What is your singularity version?

$ singularity --version

If it's >= 3.1 check if singularity exec --help has --writable-tmpfs flag. If so, please edit the backend like the following.

Replace

singularity exec --home

with

singularity exec --writable-tmpfs --home
alexlenail commented 5 years ago

Thanks for your reply, @leepc12 !

You're right, it seems like the job was hanging with :rw, but it seems to be hanging now as well with --writable-tmpfs.

My slurm cluster also seems to be idling:

~ » sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
galaxy*      up   infinite      1   idle answer

But the status on my job is Running:

{"status":"Running","id":"68b4ec12-9f48-4547-8a4e-3b66d267e0a5"}

I'm now trying with just --writable instead of --writable-tmpfs...

leepc12 commented 5 years ago

I think it's related to this issue. Can you try to mount /dev/shm as rw somehow on our container? test script

$ touch /dev/shm/1
$ singularity exec ~/.singularity/atac-seq-pipeline-v1.1.7.simg ls -l /dev/shm/1

$ singularity exec ~/.singularity/atac-seq-pipeline-v1.1.7.simg touch /dev/shm/2
$ ls -l /dev/shm/2
alexlenail commented 5 years ago

Thanks for your response, @leepc12,

root@answer:~# touch /dev/shm/1
root@answer:~# singularity exec ~/.singularity/atac-seq-pipeline-v1.1.7.simg ls -l /dev/shm/1
ls: cannot access '/dev/shm/1': Too many levels of symbolic links
root@answer:~# singularity exec ~/.singularity/atac-seq-pipeline-v1.1.7.simg touch /dev/shm/2
touch: cannot touch '/dev/shm/2': Too many levels of symbolic links
root@answer:~# ls -l /dev/shm/2
ls: cannot access /dev/shm/2: No such file or directory
root@answer:~#
leepc12 commented 5 years ago

If you have super-user privilege on your system, try enabling overlay file system by editing singularity configuration file https://singularity.lbl.gov/docs-config#enable-overlay-boolean-defaultno?

If --writable or --writable-tmpfs doesn't work, there is no workaround that can be done on the pipeline side. Can you also try Conda method instead of singularity?

alexlenail commented 5 years ago

@leepc12 I'm deeply grateful for this support.

I think I need to reverse myself on what I said in my comment a few days ago (https://github.com/ENCODE-DCC/atac-seq-pipeline/issues/99#issuecomment-474113747)

It seems like adding "singularity_bindpath" causes the OSError to go away but the pipeline to hang, without --writable or --writable-tmpfs now. So based on my most recent set of tests:

no "singularity_bindpath" OSError
"singularity_bindpath" without --writable or --writable-tmpfs Hangs
"singularity_bindpath" with --writable-tmpfs Hangs
"singularity_bindpath" with --writable Hangs

In my previous comment, I said

"singularity_bindpath" without --writable or --writable-tmpfs OSError

But can no longer reproduce that.


If you have 3 replicates but defined NUM_CONCURRENT_TASK=2 then cromwell will hold bowtie2 for rep3 until rep1 and rep2 are done.

The job I'm trying to schedule has no replicates, see the whole JSON I'm submitting here: https://github.com/ENCODE-DCC/atac-seq-pipeline/issues/99#issuecomment-474052536


How can I debug what's going on? Cromwell isn't logging anything as far as I can tell, when the job hangs. Are there other logs files which might have clues?


I do have sudo access on my system, but am not sure why I want --overlay. Why isn't the bind mount and --writable solving it? Is there a way to check that the bind mount is really being mounted?

If we try the overlay option, the steps are:

  1. configure singularity to allow overlays
  2. make a directory for the overlays?
  3. add a --overlay to the backend.conf submit script?
leepc12 commented 5 years ago

@zfrenchee: Cromwell server mode is actually a job manager. So running a cromwell server on SLURM cluster is running another job manger on a job manager. There will be too many layers affecting this problem. So we don't recommend to run a cromwell server on HPCs. You can simply sbatch a shell script with singularity backend.. Please see this doc for details.

leepc12 commented 5 years ago

Did you use cromwell ver 34 (cromwell-34.jar) to run the server?

alexlenail commented 5 years ago

@leepc12 yes

leepc12 commented 5 years ago

Okay. I was just wondering if you used cromwell-38. It's buggy and doesn't work with our slurm_singularity backend.

alexlenail commented 5 years ago

@leepc12

Cromwell server mode is actually a job manager. So running a cromwell server on SLURM cluster is running another job manger on a job manager. There will be too many layers affecting this problem. So we don't recommend to run a cromwell server on HPCs.

The singularity backend and the slurm_singularity backend seem different, in that the slurm_singularity backend includes resource management which is not included in the singularity backend

https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/56c70869bebb40886ef6f9880651f12bfa3d74af/backends/backend.conf#L70-L74

Isn't this why you include slurm_singularity in your backend.conf, to run cromwell this way?

I'd prefer not to try conda, because the dependency management is less clean than a container-based system like singularity. But if we can't get this to work, I'll have to try it.

leepc12 commented 5 years ago

@zfrenchee: Yes two backends (singularity and slurm_singularity) are different. slurm_singularity is still there in the backend.conf file. We don't recommend slurm_singularity on HPC's but kept it in the backend file for other purposes.

Yes, container-based method is much better than Conda. Conda is not very clean and does not cover all OS/platforms. But singularity also has problems on some platforms (like old CentOS) because of its directory binding.

alexlenail commented 5 years ago

@leepc12

I found an error message when the containers hang:

FATAL:   container creation failed: unabled to /pool/data/cromwell-aals to mount list: destination /pool/data/cromwell-aals is already in the mount point list

I then changed the slurm_singularity config submit from

https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/ab4143402df33b10142d19c56e6ec899f9ce2687/backends/backend.conf#L98

to

--wrap "chmod u+x ${script} && SINGULARITY_BINDPATH=${singularity_bindpath} singularity exec --home ${cwd} ${singularity_container} ${script}")

(i.e. removing $(echo ${cwd} | sed 's/cromwell-executions/\n/g' | head -n1), from SINGULARITY_BINDPATH). I submit with workflow opts:

{
    "default_runtime_attributes" : {
        "singularity_container" : "~/.singularity/atac-seq-pipeline-v1.1.7.simg",
        "singularity_bindpath" : "/pool/data"
    }
}

Removing $(echo ${cwd} | sed 's/cromwell-executions/\n/g' | head -n1), in the way I show above gets me back to OSError: [Errno 30] Read-only file system. What does $(echo ${cwd} | sed 's/cromwell-executions/\n/g' | head -n1), evaluate to anyhow?

What should I try next?

alexlenail commented 5 years ago

I also just noticed your commit involving LD_LIBRARY_PATH: https://github.com/ENCODE-DCC/atac-seq-pipeline/commit/9730ff02b83d2bd0286e89d9ca34d4f792249de9

Should I update to use that as well? (I am using singularity 3.1)

alexlenail commented 5 years ago

If I add back writable-tmpfs I still get OSError.

If I add back --writable I get

WARNING: no overlay partition found
/.singularity.d/actions/exec: 9: exec: /pool/data/cromwell-aals/cromwell-executions/atac/eb83fe65-e2f9-46a6-912b-8e417e1a9881/call-trim_adapter/shard-0/execution/script: not found
leepc12 commented 5 years ago

That LD_LIBRARY_PATH fix is not relevant to your case.

I re-tested slurm_singularity on my SLURM cluster and it worked fine but I am not sure if this is helpful for you.

starting a server (on a login node)

java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=slurm_singularity ~/cromwell-38.jar server

submitting a job (on a login node)

java -jar ~/cromwell-38.jar submit test_backend.wdl -o test_backend.wo.json

monitoring job

            JOBID          PARTITION            NAME     USER ST       TIME  NODES NODELIST(REASON)
          39736340           akundaje cromwell_fc81f7  leepc12 PD       0:00      1 (Priority)

test_backend.wdl

workflow test_backend {
        call t1 {input: a = 'a'}
}

task t1 {
        String a
        command {
                echo test > test.txt
        }
        output {
                File out = 'test.txt'
        }
        runtime {
                cpu : 2
                memory : "2000 MB"
                time : 4
        }
}

test_backend.wo.json

{
    "default_runtime_attributes" : {
        "singularity_container" : "/home/groups/cherry/encode/pipeline_singularity_images/atac-seq-pipeline-v1.1.7.simg",
        "slurm_partition" : "akundaje"
    }
}

server log

2019-03-27 10:43:33,524 cromwell-system-akka.dispatchers.api-dispatcher-32 INFO  - Unspecified type (Unspecified version) workflow fc81f708-8a1c-4418-851f-87d88d371783 submitted
2019-03-27 10:43:43,167 cromwell-system-akka.dispatchers.engine-dispatcher-36 INFO  - 1 new workflows fetched
2019-03-27 10:43:43,167 cromwell-system-akka.dispatchers.engine-dispatcher-101 INFO  - WorkflowManagerActor Starting workflow UUID(fc81f708-8a1c-4418-851f-87d88d371783)
2019-03-27 10:43:43,167 cromwell-system-akka.dispatchers.engine-dispatcher-101 INFO  - WorkflowManagerActor Successfully started WorkflowActor-fc81f708-8a1c-4418-851f-87d88d371783
2019-03-27 10:43:43,168 cromwell-system-akka.dispatchers.engine-dispatcher-101 INFO  - Retrieved 1 workflows from the WorkflowStoreActor
2019-03-27 10:43:43,187 cromwell-system-akka.dispatchers.engine-dispatcher-37 INFO  - MaterializeWorkflowDescriptorActor [UUID(fc81f708)]: Parsing workflow as WDL draft-2
2019-03-27 10:43:43,202 cromwell-system-akka.dispatchers.engine-dispatcher-37 INFO  - MaterializeWorkflowDescriptorActor [UUID(fc81f708)]: Call-to-Backend assignments: test_backend.t1 -> slurm_singularity
2019-03-27 10:43:45,496 cromwell-system-akka.dispatchers.engine-dispatcher-80 INFO  - WorkflowExecutionActor-fc81f708-8a1c-4418-851f-87d88d371783 [UUID(fc81f708)]: Starting test_backend.t1
2019-03-27 10:43:46,343 cromwell-system-akka.dispatchers.engine-dispatcher-89 INFO  - Assigned new job execution tokens to the following groups: fc81f708: 1
2019-03-27 10:43:46,545 cromwell-system-akka.dispatchers.backend-dispatcher-59 INFO  - DispatchedConfigAsyncJobExecutionActor [UUID(fc81f708)test_backend.t1:NA:1]: `echo test > test.txt`
2019-03-27 10:43:46,618 cromwell-system-akka.dispatchers.backend-dispatcher-59 INFO  - DispatchedConfigAsyncJobExecutionActor [UUID(fc81f708)test_backend.t1:NA:1]: executing: ls /home/groups/cherry/encode/pipeline_singularity_images/atac-seq-pipeline-v1.1.7.simg $(echo  | tr , ' ') 1>/dev/null && (sbatch \
--export=ALL \
-J cromwell_fc81f708_t1 \
-D /oak/stanford/groups/akundaje/leepc12/code/atac-seq-pipeline/cromwell-executions/test_backend/fc81f708-8a1c-4418-851f-87d88d371783/call-t1 \
-o /oak/stanford/groups/akundaje/leepc12/code/atac-seq-pipeline/cromwell-executions/test_backend/fc81f708-8a1c-4418-851f-87d88d371783/call-t1/execution/stdout \
-e /oak/stanford/groups/akundaje/leepc12/code/atac-seq-pipeline/cromwell-executions/test_backend/fc81f708-8a1c-4418-851f-87d88d371783/call-t1/execution/stderr \
-t 240 \
-n 1 \
--ntasks-per-node=1 \
--cpus-per-task=2 \
--mem=2000 \
-p akundaje \
 \
 \
 \
--wrap "chmod u+x /oak/stanford/groups/akundaje/leepc12/code/atac-seq-pipeline/cromwell-executions/test_backend/fc81f708-8a1c-4418-851f-87d88d371783/call-t1/execution/script && LD_LIBRARY_PATH=:$LD_LIBRARY_PATH SINGULARITY_BINDPATH=$(echo /oak/stanford/groups/akundaje/leepc12/code/atac-seq-pipeline/cromwell-executions/test_backend/fc81f708-8a1c-4418-851f-87d88d371783/call-t1 | sed 's/cromwell-executions/\n/g' | head -n1), singularity exec --home /oak/stanford/groups/akundaje/leepc12/code/atac-seq-pipeline/cromwell-executions/test_backend/fc81f708-8a1c-4418-851f-87d88d371783/call-t1  /home/groups/cherry/encode/pipeline_singularity_images/atac-seq-pipeline-v1.1.7.simg /oak/stanford/groups/akundaje/leepc12/code/atac-seq-pipeline/cromwell-executions/test_backend/fc81f708-8a1c-4418-851f-87d88d371783/call-t1/execution/script")
2019-03-27 10:43:49,049 cromwell-system-akka.dispatchers.backend-dispatcher-134 INFO  - DispatchedConfigAsyncJobExecutionActor [UUID(fc81f708)test_backend.t1:NA:1]: job id: 39736340
2019-03-27 10:43:49,117 cromwell-system-akka.dispatchers.backend-dispatcher-134 INFO  - DispatchedConfigAsyncJobExecutionActor [UUID(fc81f708)test_backend.t1:NA:1]: Cromwell will watch for an rc file but will *not* double-check whether this job is actually alive (unless Cromwell restarts)
2019-03-27 10:43:49,119 cromwell-system-akka.dispatchers.backend-dispatcher-61 INFO  - DispatchedConfigAsyncJobExecutionActor [UUID(fc81f708)test_backend.t1:NA:1]: Status change from - to Running
2019-03-27 10:46:46,906 cromwell-system-akka.dispatchers.backend-dispatcher-133 INFO  - DispatchedConfigAsyncJobExecutionActor [UUID(fc81f708)test_backend.t1:NA:1]: Status change from Running to Done
2019-03-27 10:46:48,122 cromwell-system-akka.dispatchers.engine-dispatcher-58 INFO  - WorkflowExecutionActor-fc81f708-8a1c-4418-851f-87d88d371783 [UUID(fc81f708)]: Workflow test_backend complete. Final Outputs:
{
  "test_backend.t1.out": "/oak/stanford/groups/akundaje/leepc12/code/atac-seq-pipeline/cromwell-executions/test_backend/fc81f708-8a1c-4418-851f-87d88d371783/call-t1/execution/test.txt"
}
2019-03-27 10:46:48,149 cromwell-system-akka.dispatchers.engine-dispatcher-47 INFO  - WorkflowManagerActor WorkflowActor-fc81f708-8a1c-4418-851f-87d88d371783 is in a terminal state: WorkflowSucceededState
leepc12 commented 5 years ago

Here is a singularity configuration file on my cluster.

$ cat /etc/singularity/singularity.conf
# SINGULARITY.CONF
# This is the global configuration file for Singularity. This file controls
# what the container is allowed to do on a particular host, and as a result
# this file must be owned by root.

# ALLOW SETUID: [BOOL]
# DEFAULT: yes
# Should we allow users to utilize the setuid program flow within Singularity?
# note1: This is the default mode, and to utilize all features, this option
# must be enabled.  For example, without this option loop mounts of image
# files will not work; only sandbox image directories, which do not need loop
# mounts, will work (subject to note 2).
# note2: If this option is disabled, it will rely on unprivileged user
# namespaces which have not been integrated equally between different Linux
# distributions.
allow setuid = yes

# MAX LOOP DEVICES: [INT]
# DEFAULT: 256
# Set the maximum number of loop devices that Singularity should ever attempt
# to utilize.
max loop devices = 256

# ALLOW PID NS: [BOOL]
# DEFAULT: yes
# Should we allow users to request the PID namespace? Note that for some HPC
# resources, the PID namespace may confuse the resource manager and break how
# some MPI implementations utilize shared memory. (note, on some older
# systems, the PID namespace is always used)
allow pid ns = yes

# CONFIG PASSWD: [BOOL]
# DEFAULT: yes
# If /etc/passwd exists within the container, this will automatically append
# an entry for the calling user.
config passwd = yes

# CONFIG GROUP: [BOOL]
# DEFAULT: yes
# If /etc/group exists within the container, this will automatically append
# group entries for the calling user.
config group = yes

# CONFIG RESOLV_CONF: [BOOL]
# DEFAULT: yes
# If there is a bind point within the container, use the host's
# /etc/resolv.conf.
config resolv_conf = yes

# MOUNT PROC: [BOOL]
# DEFAULT: yes
# Should we automatically bind mount /proc within the container?
mount proc = yes

# MOUNT SYS: [BOOL]
# DEFAULT: yes
# Should we automatically bind mount /sys within the container?
mount sys = yes

# MOUNT DEV: [yes/no/minimal]
# DEFAULT: yes
# Should we automatically bind mount /dev within the container? If 'minimal'
# is chosen, then only 'null', 'zero', 'random', 'urandom', and 'shm' will
# be included (the same effect as the --contain options)
mount dev = yes

# MOUNT DEVPTS: [BOOL]
# DEFAULT: yes
# Should we mount a new instance of devpts if there is a 'minimal'
# /dev, or -C is passed?  Note, this requires that your kernel was
# configured with CONFIG_DEVPTS_MULTIPLE_INSTANCES=y, or that you're
# running kernel 4.7 or newer.
mount devpts = yes

# MOUNT HOME: [BOOL]
# DEFAULT: yes
# Should we automatically determine the calling user's home directory and
# attempt to mount it's base path into the container? If the --contain option
# is used, the home directory will be created within the session directory or
# can be overridden with the SINGULARITY_HOME or SINGULARITY_WORKDIR
# environment variables (or their corresponding command line options).
mount home = yes

# MOUNT TMP: [BOOL]
# DEFAULT: yes
# Should we automatically bind mount /tmp and /var/tmp into the container? If
# the --contain option is used, both tmp locations will be created in the
# session directory or can be specified via the  SINGULARITY_WORKDIR
# environment variable (or the --workingdir command line option).
mount tmp = yes

# MOUNT HOSTFS: [BOOL]
# DEFAULT: no
# Probe for all mounted file systems that are mounted on the host, and bind
# those into the container?
mount hostfs = no

# BIND PATH: [STRING]
# DEFAULT: Undefined
# Define a list of files/directories that should be made available from within
# the container. The file or directory must exist within the container on
# which to attach to. you can specify a different source and destination
# path (respectively) with a colon; otherwise source and dest are the same.
#bind path = /etc/singularity/default-nsswitch.conf:/etc/nsswitch.conf
#bind path = /opt
#bind path = /scratch
bind path = /etc/localtime
bind path = /etc/hosts

# USER BIND CONTROL: [BOOL]
# DEFAULT: yes
# Allow users to influence and/or define bind points at runtime? This will allow
# users to specify bind points, scratch and tmp locations. (note: User bind
# control is only allowed if the host also supports PR_SET_NO_NEW_PRIVS)
user bind control = yes

# ENABLE OVERLAY: [yes/no/try]
# DEFAULT: try
# Enabling this option will make it possible to specify bind paths to locations
# that do not currently exist within the container.  If 'try' is chosen,
# overlayfs will be tried but if it is unavailable it will be silently ignored.
enable overlay = try

# ENABLE UNDERLAY: [yes/no]
# DEFAULT: yes
# Enabling this option will make it possible to specify bind paths to locations
# that do not currently exist within the container even if overlay is not
# working.  If overlay is available, it will be tried first.
enable underlay = yes

# MOUNT SLAVE: [BOOL]
# DEFAULT: yes
# Should we automatically propagate file-system changes from the host?
# This should be set to 'yes' when autofs mounts in the system should
# show up in the container.
mount slave = yes

# SESSIONDIR MAXSIZE: [STRING]
# DEFAULT: 16
# This specifies how large the default sessiondir should be (in MB) and it will
# only affect users who use the "--contain" options and don't also specify a
# location to do default read/writes to (e.g. "--workdir" or "--home").
sessiondir max size = 16

# LIMIT CONTAINER OWNERS: [STRING]
# DEFAULT: NULL
# Only allow containers to be used that are owned by a given user. If this
# configuration is undefined (commented or set to NULL), all containers are
# allowed to be used. This feature only applies when Singularity is running in
# SUID mode and the user is non-root.
#limit container owners = gmk, singularity, nobody

# LIMIT CONTAINER GROUPS: [STRING]
# DEFAULT: NULL
# Only allow containers to be used that are owned by a given group. If this
# configuration is undefined (commented or set to NULL), all containers are
# allowed to be used. This feature only applies when Singularity is running in
# SUID mode and the user is non-root.
#limit container groups = group1, singularity, nobody

# LIMIT CONTAINER PATHS: [STRING]
# DEFAULT: NULL
# Only allow containers to be used that are located within an allowed path
# prefix. If this configuration is undefined (commented or set to NULL),
# containers will be allowed to run from anywhere on the file system. This
# feature only applies when Singularity is running in SUID mode and the user is
# non-root.
#limit container paths = /scratch, /tmp, /global

# ALLOW CONTAINER ${TYPE}: [BOOL]
# DEFAULT: yes
# This feature limits what kind of containers that Singularity will allow
# users to use (note this does not apply for root).
allow container squashfs = yes
allow container extfs = yes
allow container dir = yes

# AUTOFS BUG PATH: [STRING]
# DEFAULT: Undefined
# Define list of autofs directories which produces "Too many levels of symbolink links"
# errors when accessed from container (typically bind mounts)
#autofs bug path = /nfs
#autofs bug path = /cifs-share

# ALWAYS USE NV ${TYPE}: [BOOL]
# DEFAULT: no
# This feature allows an administrator to determine that every action command
# should be executed implicitely with the --nv option (useful for GPU only
# environments).
always use nv = no

# ROOT DEFAULT CAPABILITIES: [full/file/no]
# DEFAULT: no
# Define default root capability set kept during runtime
# - full: keep all capabilities (same as --keep-privs)
# - file: keep capabilities configured in ${prefix}/etc/singularity/capabilities/user.root
# - no: no capabilities (same as --no-privs)
root default capabilities = full

# MEMORY FS TYPE: [tmpfs/ramfs]
# DEFAULT: tmpfs
# This feature allow to choose temporary filesystem type used by Singularity.
# Cray CLE 5 and 6 up to CLE 6.0.UP05 there is an issue (kernel panic) when Singularity
# use tmpfs, so on affected version it's recommended to set this value to ramfs to avoid
# kernel panic
memory fs type = tmpfs

# CNI CONFIGURATION PATH: [STRING]
# DEFAULT: Undefined
# Defines path from where CNI configuration files are stored
#cni configuration path =

# CNI PLUGIN PATH: [STRING]
# DEFAULT: Undefined
# Defines path from where CNI executable plugins are stored
#cni plugin path =

# MKSQUASHFS PATH: [STRING]
# DEFAULT: Undefined
# This allows the administrator to specify the location for mksquashfs if it is not
# installed in a standard system location
# mksquashfs path =

# SHARED LOOP DEVICES: [BOOL]
# DEFAULT: no
# Allow to share same images associated with loop devices to minimize loop
# usage and optimize kernel cache (useful for MPI)
shared loop devices = no
alexlenail commented 5 years ago
  1. Our singularity configs are identical.
  2. I can also run the test.wdl without any issues -- i/o and permissions against the filesystem seems to be the issue (OSError: [Errno 30] Read-only file system).
  3. What is https://github.com/ENCODE-DCC/atac-seq-pipeline/commit/5a8915354d76913e39755cdb56042a28f04955d5 "fix for issue #99 (using built-in bash background process in cromwell for singularity backends) ? Will it really fix my issue?
alexlenail commented 5 years ago

@leepc12

  1. What does $(echo ${cwd} | sed 's/cromwell-executions/\n/g' | head -n1), evaluate to anyhow?

  2. Why does adding --writable cause WARNING: no overlay partition found ?

    Perhaps something having to do with this? How do I bind /pool/data:/pool/data as writable in a singularity container?

  3. What version of singularity are you using? (I'm using 3.1)

leepc12 commented 5 years ago

1 and 3. No there are more commits to fix hanging problem for slurm_singularity backend. You can test with this PR (https://github.com/ENCODE-DCC/atac-seq-pipeline/pull/102).

  1. I actually don't have solution to that Read-only file system problem. It looks somehow related to python and singularity configuration. Please try with the above PR and see if that fixes it.

  1. That extracts cromwell_root, which is cromwell-executions/ by default, and add it to SINGULARITY_BINDPATH.

  2. We actually assume that pipelines run on an overlay file system. Please don't use mapping (A:B). Use A only.

  3. I have both 2.5.1 and 3.1 so trying to make our pipelines compatible with both.

alexlenail commented 5 years ago

Thanks for your answers, @leepc12

I copied over the changes to backend.conf from #102 but am still getting the OSError.

Could this have something specifically to do with python multiprocessing? After all, the error is:

Job atac.trim_adapter:0:1 exited with return code 1
...
  File "/software/atac-seq-pipeline/src/encode_trim_adapter.py", line 265, in <module>
    main()
  File "/software/atac-seq-pipeline/src/encode_trim_adapter.py", line 168, in main
    pool = multiprocessing.Pool(num_process)
  File "/usr/lib/python2.7/multiprocessing/__init__.py", line 232, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 138, in __init__
    self._setup_queues()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 234, in _setup_queues
    self._inqueue = SimpleQueue()
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 354, in __init__
    self._rlock = Lock()
  File "/usr/lib/python2.7/multiprocessing/synchronize.py", line 147, in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1)
  File "/usr/lib/python2.7/multiprocessing/synchronize.py", line 75, in __init__
    sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)
OSError: [Errno 30] Read-only file system
alexlenail commented 5 years ago

When I skip using cromwell, and manually just do:

$ sudo sbatch 
    --export=ALL 
    -D /pool/data/cromwell-aals/cromwell-executions/atac/5e0d8c63-f0ab-466b-9d26-6b49cc41dcee/call-trim_adapter/shard-0 
    -o ~/stdout.txt 
    -e ~/stderr.txt 
    -n 1 
    --ntasks-per-node=1 
    --wrap "singularity exec 
            --cleanenv 
            -B /pool/data 
            --home /pool/data/cromwell-aals/cromwell-executions/atac/5e0d8c63-f0ab-466b-9d26-6b49cc41dcee/call-trim_adapter/shard-0 
            /home/lenail/.singularity/atac-seq-pipeline-v1.1.7.simg 
            /bin/bash /pool/data/cromwell-aals/cromwell-executions/atac/5e0d8c63-f0ab-466b-9d26-6b49cc41dcee/call-trim_adapter/shard-0/execution/script"

I get:

~ » cat stderr.txt
ln: failed to access 'R1/*.fastq.gz': No such file or directory
ln: failed to access 'R2/*.fastq.gz': No such file or directory

which must be coming from the ln's in script:

( ln -L R1/*.fastq.gz /pool/data/cromwell-aals/cromwell-executions/atac/5e0d8c63-f0ab-466b-9d26-6b49cc41dcee/call-trim_adapter/shard-0/execution/glob-cf395bba00b93cc4a5f238577ff98973 2> /dev/null ) || ( ln R1/*.fastq.gz /pool/data/cromwell-aals/cromwell-executions/atac/5e0d8c63-f0ab-466b-9d26-6b49cc41dcee/call-trim_adapter/shard-0/execution/glob-cf395bba00b93cc4a5f238577ff98973 )

What is the step that is supposed to localize my files to R1/*.fastq.gz?

Here's how I specify the fastq's in atac_trial.json:

        "atac.fastqs_rep1_R1" : [ "/pool/data/globus/epigenomics/1_fastq/6_protocol_selection/diMN32/ALS-0BUU_diMN32_rep1_1.fastq" ],
        "atac.fastqs_rep1_R2" : [ "/pool/data/globus/epigenomics/1_fastq/6_protocol_selection/diMN32/ALS-0BUU_diMN32_rep1_2.fastq" ],

Note that they are not gzipped when I submit them, and therefore will not be accessible via *.fastq.gz. I'm using v1.17 so I expect this should not be an issue?

alexlenail commented 5 years ago

The solution turned out to be fairly "deep": -B /pool/data,/run

From this singularity thread

because of this debian "bug"