broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
996 stars 360 forks source link

Add support for Singularity images #2177

Open antonkulaga opened 7 years ago

antonkulaga commented 7 years ago

In many use-cases (exp. HPC and systems when you cannot run from root) Singularity ( http://singularity.lbl.gov/about ) is better than docker, it would be nice to see its support in Cromwell

vsoch commented 7 years ago

@katevoss I'm one of the developers of Singularity and I would like to +1 this request! I don't know scala, but if it comes down to making an equivalent folder like this one for Docker I can give a first stab at it. Or if it's more helpful I can give complete examples for all the steps to working with singularity images. We have both a registry (Singularity Hub that is hooked up to the singularity command line client to work with images. So - to integrate into cromwell you could either just run the container via a singularity command, or implement your own connection to our API to download the image. Please let me know how I might be helpful, and I'd gladly help. If you want me to give a go at scala I would just ask for your general workflow to compile and test functionality.

katevoss commented 7 years ago

Hi @vsoch, thanks for chiming in! Supporting Singularity is definitely on our to-do list but it will take some exploration so it would be great to have your assistance as we plan our approach. I can't be sure when we'll start but we'll definitely keep you in the loop. Thanks again!

vsoch commented 7 years ago

awesome! Yeah just ping on here when you are ready, and I'll be glad to help :)

katevoss commented 7 years ago

@geoffjentry I know I've heard Singularity come up fairly regularly, do you know if there are users who are insisting on using Singularity in order to use Cromwell?

geoffjentry commented 7 years ago

@katevoss I don't know of anyone who has said "I need this" or "If it existed I'd use Cromwell". Rather it's a topic which is building more steam across the space and I'm suggesting it'd be nice to be leaders and not followers here.

katevoss commented 7 years ago

As a user with images in Singularity, I want Cromwell to support using Singularity images (either via Singularity Hub and the command line, or connecting via API), so that I can use Singularity images and not have to duplicate them in Docker.

vsoch commented 7 years ago

I can also offer to help, in whatever form is useful! If you just need to use / pull, then Singularity image support via installing it should fit the bill. Users can use Github to host images via Singularity Hub. If you want to host your own registry, then Singularity Registry is the way to go! Let me know if I can help, etc.

geoffjentry commented 6 years ago

@katevoss I've been thinking about trying to tackle this as my holiday break project. If nothing else I should have a better idea of what's involved on our side.

katevoss commented 6 years ago

πŸ‘ 🎁 πŸŽ„ πŸ’― πŸ•Ž πŸ•―

rhpvorderman commented 6 years ago

I'm checking out WDL/Cromwell at the moment and this feature would make Cromwell definitely more interesting. It would make it much easier to run reproducible pipelines without relying on docker. (Docker is a no go on our cluster because it gives users root access.)

ps-account commented 6 years ago

I just found out that Cromwell-Singularity integration will be on the agenda on Winter Codefest 2018, starting tomorrow! See https://docs.google.com/document/d/1RlDUWRFqMcy4V2vvkA1_ENsVo6TXge2wIO_Nf73Itk0/edit#heading=h.xg79ql4rt605

You can join in (also remotely) by checking this file: https://docs.google.com/spreadsheets/d/1o4xDUgl2iu_CgFuDpB1swtG8XVZK3aifvKlhh5qagyI/edit#gid=0

geoffjentry commented 6 years ago

@pimpim just a heads up that I threw that on there as a suggestion so it relies on people sharing the interest :)

We do expect to have udocker support soon via work being done by another group - I’ve heard rumors that one can run singularity via udocker so that might be another approach

ps-account commented 6 years ago

I also encountered the udocker-singularity route in the discussion on cwltool singularity integration. Maybe it is an idea to take a closer look on the udocker-singularity implementation as a starting point for workflow tool singularity usage.

Or maybe not, because you will lose HPC friendly singularity features this way!

ps-account commented 6 years ago

@geoffjentry In case this is accessible, can you point me to the udocker singularity work you mentioned?

ps-account commented 6 years ago

With udocker running in proot vs Singularity running in chroot, some HPC performance/IB/GPUcapability issues might occur in this route.

oskarvid commented 6 years ago

I just want to chime in and say that support for Singularity would be useful, it's nice to see that you are working on it!

jim-bo commented 6 years ago

I support this as well.

abdulrauf commented 6 years ago

@geoffjentry is there any update on udocker support or is that already works with some tricks ?

I got same question for singularity as well.

vsoch commented 6 years ago

hey I noticed that you guys use Google Cloud? http://cromwell.readthedocs.io/en/develop/wf_options/Google/ I have a builder that runs here, so there might be some synthesis between the two, although I'm not super familiar with Cromwell. If you just need to use Singularity containers your best bet is to do a singularity pull (and wrap these commands into your workflow functions, allowing the user to specify the container uri). if there is more of a service that someone is running with cromwell and you want to dip into the storage directly (and would use the API en masse) then we could try this --> https://cloud.google.com/storage/docs/requester-pays

ps-account commented 6 years ago

As far as I remember, one issue with some workflow managers concerned the naming of containers in the workflow format. E.g. CWL had/has Docker hard-coded into it. Some attention has been given at the last biohackathon, please check this link: https://twitter.com/biocrusoe/status/954738513475448835

vsoch commented 6 years ago

hey friends! Just wanted to poke here again that this is still badly wanted / needed / desired / dreamed of / prayed for / sacrificial lambs... (you get the idea :P _) Any updates? Can I help?

geoffjentry commented 6 years ago

Hi @vsoch - the first problem to solve is how to represent the usage of singularity in one's WDL (not sure how CWL does it, will need to look). This is being discussed in the OpenWDL group so if you have thoughts here that'd be very welcome.

For instance, is there a way to express "run this container" but not be locking a downstream WDL user into Singularity vs Docker?

vsoch commented 6 years ago

I'm not great / experienced with Cromwell, and to be honest I'm not sure what native support would mean. What I was trying is to just treat a singularity container like an executable, and add it as a Local backend, sort of like this --> https://github.com/vsoch/wgbs-pipeline/pull/1/files#diff-f6baca157827c4888c394eab694e000c

That works to run the analysis step (in a singularity container) just using singularity like any executable. I don't totally understand the job_id so there is a bug, but my colleague @bek is going to take a look! The container is run to produce the output, so that's a good start at least (and probably I'm missing something huge here).

So to answer your question... in my wdl at least, I'm just using the same local commands. It looks the same as it would running any Local backend configuration.

vsoch commented 6 years ago

Yeah doesn't it come down to:

# singularity
$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=singularity cromwell-34.jar run runners/test.wdl -i data/TEST-YEAST/inputs.json -o workflow_opts/singularity.json

vs

# docker
$ java -jar -Dconfig.file=backends/backend.conf -Dbackend.default=docker cromwell-34.jar run runners/test.wdl -i data/TEST-YEAST/inputs.json -o workflow_opts/docker.json

so you use the same *.wdl but just choose a different backend / and workflow opts?

?

geoffjentry commented 6 years ago

Oh interesting, so in this way it's more similar to the udocker hacks that people have used (which I can't find an example of right now, but they exist). That could certainly get us most of the way there, if not all of the way there.

I think I've been viewing Singularity & Docker as more of an "either/or" in that perhaps a task would require a singularity container vs a docker container - but if that's not really the case I've definitely been overcomplicating the matter. I'll admit that I've never been comfortable in my understanding of Singularity.

@vsoch you're obviously well versed in all things Singularity - do you see any utility to defining the use of a Singularity container in the WDL (ie no matter what this task should always use Singularity) or is it going to be more of a site specific situation, like hwat you're showing here?

vsoch commented 6 years ago

I think I've been viewing Singularity & Docker as more of an "either/or" in that perhaps a task would require a singularity container vs a docker container - but if that's not really the case I've definitely been overcomplicating the matter. I'll admit that I've never been comfortable in my understanding of Singularity.

If you are using a container, it definitely is an "either / or" in the sense that getting one working inside the other is pretty challenging. The reason a Dockerized cromwell doesn't work on a host (to submit jobs to other docker or singularity containers) is because of having the docker/singularity submit come from inside the container. We don't really want to do that anyway, because there is a double dependency. But on the other hand, we want to provide reproducible solutions, meaning that things are container based. In an ideal setup, I would have some (still container based) cromwell acting as more of a docker-compose setup, and issuing commands to other containers. Ideally there would be one maintained Docker container for a step in a pipeline, and then if it's run on an HPC resource (where you can't have docker) it would just be dumped into singularity (docker://<username>/<reponame>)

But this case is a little different - I'm just talking about the cromwell "plugin". I don't actually understand why this is necessary, at least given that singularity containers can act like executable. If I want to run a python script, I run it in the command section, as an executable. I don't require a python plugin. Now given that Singularity changes so that we want to take advantage of more of the instance commands (e.g., we can start, stop, get a status) this might make it more like docker and warrant a plugin. But for now, it's not quite there, and making a plugin would just be a really fancy interface to run an executable. Does this make sense?

@vsoch you're obviously well versed in all things Singularity - do you see any utility to defining the use of a Singularity container in the WDL (ie no matter what this task should always use Singularity) or is it going to be more of a site specific situation, like hwat you're showing here?

I don't think it would be site specific (if the container is singularity, it would largely be the same, a container_uri and then some args to it). The only reason I have two sections is because I was trying out two ways to do it. Neither of them fully work (at least according to cromwell) because I don't know what that job_id business it :)

geoffjentry commented 6 years ago

Hi @vsoch - to be clear, what I mean is this ...

If I'm writing a WDL and I want to put some container in the runtime block, should I be opinionated as to if it's singularity or docker or should that be up to the person running the WDL? I used to view it as the former, but now I think it's the latter?

vsoch commented 6 years ago

Wouldn't it be up to the person running the wdl? If it's not up to me, how I am empowered to say I am using slurm vs a container environment like kubernetes? to be clear I've only used Cromwell a day and a half so I'm not the right person to answer this question. I'm trying to understand how Singularity would fit in beyond being a binary executable (that might work in several environments). I think @bek might be able to weigh in?

geoffjentry commented 6 years ago

Had a convo w/ Seth yesterday and looked into a few similar things (e.g. cwltool's support). I think the proper plan is as follows:

vsoch commented 6 years ago

Aye aye! I don't know scala, but I found the developer docs and I know how to use GIthub, so I'm ready to go, lol. I likely won't start this weekend (I have a few projects I'm working on!) but next week for sure. I'll put updates, troubles, and other musings here - thanks in advance for your help :)

geoffjentry commented 6 years ago

Hi @vsoch - it shouldn't require too much of a deep dive into the scala, we know that it already can be made to work with udocker by just changing the configuration like you've done. Let me know if you've not seen the udocker example and I'll track it down for you.

vsoch commented 6 years ago

hey @geoffjentry ! Yeah I was just about to start on this! If you have another example handy that you think would be useful to see, I would definitely appreciate it.

geoffjentry commented 6 years ago

Hi @vsoch - here's what I have. A word of warning that I found it in an email thread where a user was saying it didn't work for them, but it came from someone for whom it did work so YMMV. I'm going to try to try this out myself later although it'll take me a while before I get time to install udocker and such

backend {

  # Override the default backend.

  #default = "LocalExample"

  # The list of providers.

  providers {

    # The local provider is included by default in the reference.conf. This is an example.

    # Define a new backend provider.

    Local {

      # The actor that runs the backend. In this case, it's the Shared File System (SFS) ConfigBackend.

      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"

      # The backend custom configuration.

      config {

        # Optional limits on the number of concurrent jobs

        #concurrent-job-limit = 5

        # If true submits scripts to the bash background using "&". Only usefull for dispatchers that do NOT submit

        # the job and then immediately return a scheduled job id.

        run-in-background = true

        # `temporary-directory` creates the temporary directory for commands.

        #

        # If this value is not set explicitly, the default value creates a unique temporary directory, equivalent to:

        # temporary-directory = "$(mktemp -d \"$PWD\"/tmp.XXXXXX)"

        #

        # The expression is run from the execution directory for the script. The expression must create the directory

        # if it does not exist, and then return the full path to the directory.

        #

        # To create and return a non-random temporary directory, use something like:

        # temporary-directory = "$(mkdir -p /tmp/mydir && echo /tmp/mydir)"

        # `script-epilogue` configures a shell command to run after the execution of every command block.

        #

        # If this value is not set explicitly, the default value is `sync`, equivalent to:

        # script-epilogue = "sync"

        #

        # To turn off the default `sync` behavior set this value to an empty string:

        # script-epilogue = ""

        # The list of possible runtime custom attributes.

        runtime-attributes = """

        String? docker

        String? docker_name

        """

        # Submit string when there is no "docker" runtime attribute.

        submit = "/bin/bash ${script}"

        # Submit string when there is a "docker" runtime attribute.

        submit-docker = """

        chmod u+x ${cwd}/execution/script && \

        docker run --rm \

          -v ${cwd}:${docker_cwd} \

          ${docker_name} /bin/bash -c ${script}

        """

        # Root directory where Cromwell writes job results.  This directory must be

        # visible and writeable by the Cromwell process as well as the jobs that Cromwell

        # launches.

        root = "cromwell-executions"

        # File system configuration.

        filesystems {

          # For SFS backends, the "local" configuration specifies how files are handled.

          local {

            # Try to hard link (ln), then soft-link (ln -s), and if both fail, then copy the files.

            localization: [

              "hard-link", "soft-link", "copy"

            ]

            # Call caching strategies

            caching {

              # When copying a cached result, what type of file duplication should occur. Attempted in the order listed below:

              duplication-strategy: [

                "hard-link", "soft-link", "copy"

              ]

              # Possible values: file, path

              # "file" will compute an md5 hash of the file content.

              # "path" will compute an md5 hash of the file path. This strategy will only be effective if the duplication-strategy (above) is set to "soft-link",

              # in order to allow for the original file path to be hashed.

              hashing-strategy: "file"

              # When true, will check if a sibling file with the same name and the .md5 extension exists, and if it does, use the content of this file as a hash.

              # If false or the md5 does not exist, will proceed with the above-defined hashing strategy.

              check-sibling-md5: false

            }

          }

        }

        # The defaults for runtime attributes if not provided.

        default-runtime-attributes {

          failOnStderr: false

          continueOnReturnCode: 0

        }

      }

    }

  }

}
vsoch commented 6 years ago

Cool thanks! So just to verify - I don't actually need to touch any scala, this is just a custom backend.conf for singularity (most of which I've already got a good start on?) This would simplify things quite a bit! Is this then provided in the workflow / pipeline or with cromwell here?

geoffjentry commented 6 years ago

@vsoch That's the theory. Let me know if that doesn't seem to be working for you and we can go from there.

The idea is that this would be in the Cromwell configuration and not per-workflow (but see below). In general that makes sense because a lot of the HPC-style use cases we see people never want to use actual Docker.

However, there are a few buts to the above ....

vsoch commented 6 years ago

gotcha! I'll dig into this and try for a first shot, will send back update when I break I mean, dip in my toe a big more.

vsoch commented 6 years ago

Which file in the Cromwell repo defines the backend like this? I haven’t found a contender yet with my grepping!

danbills commented 6 years ago

This is the standard way to configure cromwell, to provide your own .conf file. Best to start here

shuang-luo commented 6 years ago

I am trying to use udocker to locally run the gatk4-germline-snps-indels, using the wdl and json file offered by its GitHub page: https://github.com/gatk-workflows/gatk4-germline-snps-indels. I am locally running it with docker, until now seems working well, no fault report. I read udocker intro, it said I can use it to pull or run docker image (maybe my understand is wrong). What should I do to use udocker replace docker for this task?

vsoch commented 6 years ago

okay sorry I was confused then - @geoffjentry suggested that the backend.conf was part of the cromwell base:

The idea is that this would be in the Cromwell configuration and not per-workflow (but see below). In general that makes sense because a lot of the HPC-style use cases we see people never want to use actual Docker.

As opposed to a workflow or pipeline that uses it. For example, here is the pipeline that I was working on that has a backend.conf that runs Singularity:

https://github.com/vsoch/wgbs-pipeline/pull/1/files#diff-f6baca157827c4888c394eab694e000c

But this is not a part of cromwell, or relevant to this repo - it's just a configuration file provided with the workflow. I was under the impression that we wanted to write something that would be integrated into cromwell to interact with Singularity, and not a configuration file provided with a particular pipeline (such as the wgbs in the example above). Do you mean that there is a template folder (or some other docs) where the "suggested singularity backend" would be provided? Something different? What am I missing?

geoffjentry commented 6 years ago

@vsoch Hi - I think what you're looking for is cromwell.examples.conf which is where we put examples like this. So if there's a configuration for a backend which works we'd put it in there so we could point people at it.

Does that make sense?

vsoch commented 6 years ago

Ah gotcha! To summarize:

So I just need to write that example :) Did I get that right this time?

geoffjentry commented 6 years ago

@vsoch if in fact this scheme is going to work you have it right. It might not be feasible but we can cross that bridge if/when we get there

vsoch commented 6 years ago

Yep sounds good. I’ll do this PR after the requested changes to add the Dockerfile development testing go through. Thanks for the clarification!

vsoch commented 6 years ago

hey @geoffjentry in case it's unseen, there were requested changed for my first PR, and I did them --> https://github.com/broadinstitute/cromwell/pull/4015 so we are waiting on this guy. No rush, I'm a very happy dinosaur :)

vsoch commented 6 years ago

Hey everyone!

I've been thinking more about this and testing, and I want to offer my thoughts here. I think overall my conclusions are:

TLDR let's not try shoving a dog into a cat hole because the ears look similar. They are two different technologies, the latter (Singularity) is probably going to do great things for Cromwell because it is a single binary (and not a collection of tarballs) but we need that version 3.0 with OCI compliance to really have a well formulated language for Cromwell to talk to, period.

I can go into more detail. First, let's define the parties involved:

Definitions

What does Singularity + Cromwell look like?

People keep saying these two together, and I've been struggling to figure it out. I've been doing a lot of work trying to do that. What does it mean for Singularity to be a part of Cromwell. I first logically thought it would mean a backend, because the basic exec / run commands for Singularity don't change much (but arguments do!). But it doesn't fit well here because it's missing that API to make it a fully fledged service. To those familiar with Singularity, this is the instance command group (and not running containers as images). Then I thought it was really more of a workflow executable. But if this is the case, why is it special at all? It doesn't really fit because there is still going to be a lot of redundancy in specifying the "singularity run bit over and over again. So I think (eventually) all these use cases could fit into cromwell,

but for now, without a clean API for services, only the first two really make sense. Singularity is not special. It's just a binary.

Why has it been so confusing?

We get Singularity confused with Docker, because they are both containers. Same thing right? Sort of, but not exactly. Docker is a container technology, but actually it's older and has had time to develop a full API for services. It meets the criteria for both a backend and an executable, and this is because it can be conceptualized as both "a thing that you run" and "the thing that is the container you run in." But it's confusing. The distinction is that although Singularity is also a container, Singularity is not like Docker because it doesn't have the fully developed services API (yet!). This problem is hard because the language for Singularity containers communicating between one another, and even to the host, is not completely implemented yet. This comes down to OCI compliance, and having a way for some host to manage all of its Singularity containers. Right now we just have start and stop, but we can't connect containers, define ports, or even easily get a PID. It could (sort of?) be hacked, but we would be better off waiting for that nice standard.

Reproducible Binary (Workflow Step) vs. Environment

There is also a distinction that I haven't completely wrapped my head around. Docker is very commonly used as an environment - you put a bunch of software (e.g., samtools, bwa aligner, etc.) and then issue commands to the container with custom things. Singularity, in my mind, to be truly a reproducible thing is more of the workflow step or script. It will have the software inside, but better should have those same commands represented with internal modularity. I could arguably completely do away with the external workflow dependency if a single binary told me how to run itself, and then had more than one entrypoint defined for each step. I wouldn't need to care about the software or components inside because my host just needs to run Singularity. A container should almost be more like a hard coded binary step instead of a "come into the environment and play around, the water's fine!" It's a little bit like the ICD 10 decision to give a unique id to every combination of things (e.g., "got hit on the road by a chicken") instead of combinations of them, eg. ("got hit" + "by chicken"). The first is harder because you represent more things (more containers), but the second isn't reproducible because if you lose "by chicken" you've lost the entire workflow. Does that make sense?

What can/should we do now?

So there are two things to think about. With the current representation of a workflow, we would want Singularity to be OCI compliant, and I would propose a plan to move forward is to expect this, and contribute to Singularity itself with the mindset of "I want this to plug into AWS" or "I want this to plug into Kubernetes," etc. The backends for HPC are going to be good to go with just a SLURM or SGE backend, and then commands to load and run/exec a Singularity container. When the time comes and Singularity supports services, then we can start to develop (I think) the singularity backend configuration for cromwell, with clean commands to get statuses, start and stop, and otherwise integrate into the software. You guys seem pretty busy, so likely your best bet would be to just wait, because the community is going in that direction anyway.

The other representation is to rethink this. An approach that I like is to move away from micro managing the workflow / software, and to set requirements for the data. If you set standard formats (meaning everything from the organization of files down to the headers of a data file) on the data itself, then the software gets built around that. A researcher can have confidence that the data he is collecting will work with software because it's validated to the format. The developers can have confidence their tools will work with data because of that same format. A new graduate student knows how to develop a new tool because there are nicely defined rules. A good example is to look at the BIDS (brain imaging data structure) that (has several file formats under it) but it revolutionizing how brain imaging analysis is done. (e.g, take a look at https://www.openneuro.org.

Development of my Thinking

Finally, I want to share how I came to the thinking above. Here are the steps that I've taken in the last few weeks, and resulting thoughts from them. I started with this issue board actually, and a general goal to "Add Singularity to Cromwell." Ok.

Question 1: How do I develop Cromwell?

It first was hard for me to know where to start to develop Cromwell, because the docs just went into how to compile it on a host. So it made sense to make it easy for the developer to develop Cromwell so I made a Dockerfile to do that:

Woohoo merged! We needed to have tests too, so I followed up on that:

But unfortunately it was decided that CircleCI was too new / needed to learn stuff (this is ok!) so it's going to be closed.

Question 2: How do we add a Singularity backend?

But this is actually ok, because we realize that we don't need to add Singularity to Cromwell proper, it can just be a backend! But I didn't understand wdl, or any of the formats, so my crew in Cherry lab gave me a solid repo to startwith, and then it started to click!

I was waiting for the Dockerfile test PR to pass, but realized it probably wouldn't, so I jumped on adding the example backend workflows (still without totally understanding what/why/how, but figuring out as I went):

Question 3: But what about Cromwell+Singularity on Travis?

I got confused again when there were requests for additional tests (and something entirely different) that it made me step back. I had this growing feeling that started to solidify that there are too many layers. I am developing things and I still don't understand (or think Singularity is ready yet) to be any kind of backend. I'm forcing a dog into a cat shaped hole just because this is the hole I'm supposed to fill. Is that a good idea? I've lost sight of what the tool is trying to do. Cromwell is trying to make it easy to run a Singularity container. But if that's the case, then why has this command:

singularity run shub://vsoch/hello-world

turned into needing Cromwell (java and the jar), an inputs json file, a wdl specification, a backend configuration, and a runtime command that I can't seem to remember, and then the entire thing takes much longer than an instance to echo a tiny Rawwwwr! If this is the goal we are going for, is this making life easier for the scientist? If I'm a programmer person, and this is the minimum I am allowed for this to just run a simple container, what happens when it gets harder? I realized that without a proper services API, singularity is no more special than python, bash, samtools, it's just a binary.

And I realize also that it's easy to get caught up in details like "Should we use Travis or Circle?" Does it work on Amazon with this kind of input? And there will always be bugs! But I think the forest is being a bit lost for the trees.

Question 4: What is the direction to go in?

You can probably take what I'm saying with a grain of salt because I'm new to this entire universe, and there is so much invested there is no turning back or rethinking. But all of this seems too complicated, and too hard. What is needed is a solution that is just really stupid and simple. You have a container that understands its data. You point the container at a dataset and run it. You outsource the workflow part to the technologies that big players are building already.

This definitely isn't a "throw hands in the air" sort of deal, because most of this stuff is working already it seems? I don't know if this perspective is useful, but as a new person (outsider) I wanted to offer it because if I'm confused and find this hard, probably others are too. And minimally it's good for awareness and discussion? I'm definitely happy to help however I can! But I'd really like to not try shoving dogs into cat holes, it's a very messy business. :cat: :dog: :hole: :sos:

oneillkza commented 5 years ago

Just a note, since it didn't seem to come up in this conversation:

Some of us interested in running Cromwell are based in environments where, for CAP/ISO270001/etc compliance reasons, we can't use Docker, but can use Singularity. In this context, it doesn't really matter what form the containers take, as long as we can tell Cromwell to use Singularity to run them.

(It looks like geoffjentry underscored this in #4039 , just thought I'd add this here.)

ps-account commented 5 years ago

I fully agree Vanessa!!! I don't think this is surrendering, it's finding the solution that has been standing in plain sight all the time.

At some point in the future Singularity could have a role as a backend for workflow systems, but it's ineffective to take that idea as a starting point. I really agree that it's best to lay that idea to rest and focus on the biggest impact / low hanging fruit .

To be honest, Singularity as a workflow componetn is exactly the way I've been using Singularity in real life, whereas the idea to use it as a workflow backbone always remained ... just an idea. This is not because Singularity lacks potential there, but mostly because workflow backbones have complex requirements, and trying to fit a new tool to them that wasn't made for it in the first place is not trivial.

Moving Singularity out of the role of the backend and into the role of a workflow component, more specifically a container that understands its data, also introduces the room to give it its own subfunctions, variables, metadata, tags, etc.

This makes the starting point plainly obvious. You can just take the location where you mention the location of the executable, and put the wrapper to your singularity image there. I bet this is what most people do anyway

A next step would be to give it its own section within the workflow components. Maybe the comment of oneillkza is a high impact one, just define Singularity as a CAP/ISOblablabla compliant workflow component within Cromwell.

Another take (and not per se mutually exclusive from the take mentioned above) would be to, again, fix Singularity as a workflow component, and create a set of options and functions around it that focus on abstraction of data access etcetera.

Very curious where this will go, and thanks so much Vanessa for rethinking the approach!

Gr. Pim

On Tue, Aug 28, 2018 at 3:12 AM Vanessa Sochat notifications@github.com wrote:

Hey everyone!

I've been thinking more about this and testing, and I want to offer my thoughts here. I think overall my conclusions are:

  • We are trying to shove Singularity in as a backend and a workflow component, it's one or the other
  • It's probably more appropriately the latter - a command you would put into a workflow (e.g., like python, any binary really) because services and standards (OCI) aren't fully developed.
  • The time is soon, but it's not now, to define a Singularity backend
  • For now, give users examples of just using containers as executables, nothing special.

TLDR let's not try shoving a dog into a cat hole because the ears look similar. They are two different technologies, the latter (Singularity) is probably going to do great things for Cromwell because it is a single binary (and not a collection of tarballs) but we need that version 3.0 with OCI compliance to really have a well formulated language for Cromwell to talk to, period.

I can go into more detail. First, let's define the parties involved: Definitions

  • cromwell is a workflow executor. It understands backends, and workflows. The backends run the workflows, and cromwell is just a manager for that.
  • backend is an API really for services. The basic needs for this API are generally "start, "stop", "status," etc., and other kinds of "controller" commands for a particular executable. You have to be able to list what is going on, and get PIDs, and issue stop and status commands for the guts inside.
  • executable is a script, binary, etc. that the scientist has written all the magic into, that takes some input arguments (data, poutputs, thresholds, etc.) and "does the scientific thing" to return to the workflow manager (cromwell) that is controlling its run via the backend.

What does Singularity + Cromwell look like?

People keep saying these two together, and I've been struggling to figure it out. I've been doing a lot of work trying to do that. What does it mean for Singularity to be a part of Cromwell. I first logically thought it would mean a backend, because the basic exec / run commands for Singularity don't change much (but arguments do!). But it doesn't fit well here because it's missing that API to make it a fully fledged service. To those familiar with Singularity, this is the instance command group (and not running containers as images). Then I thought it was really more of a workflow executable. But if this is the case, why is it special at all? It doesn't really fit because there is still going to be a lot of redundancy in specifying the "singularity run bit over and over again. So I think (eventually) all these use cases could fit into cromwell,

  • running a singularity container as an executable with a backend like slurm
  • running a singularity container as an executable on with Local (host) backend
  • running a container as a backend as a container instance (via its API)

but for now, without a clean API for services, only the first two really make sense. Singularity is not special. It's just a binary. Why has it been so confusing?

We get Singularity confused with Docker, because they are both containers. Same thing right? Sort of, but not exactly. Docker is a container technology, but actually it's older and has had time to develop a full API for services. It meets the criteria for both a backend and an executable, and this is because it can be conceptualized as both "a thing that you run" and "the thing that is the container you run in." But it's confusing. The distinction is that although Singularity is also a container, Singularity is not like Docker because it doesn't have the fully developed services API (yet!). This problem is hard because the language for Singularity containers communicating between one another, and even to the host, is not completely implemented yet. This comes down to OCI compliance, and having a way for some host to manage all of its Singularity containers. Right now we just have start and stop, but we can't connect containers, define ports, or even easily get a PID. It could (sort of?) be hacked, but we would be better off waiting for that nice standard. Reproducible Binary (Workflow Step) vs. Environment

There is also a distinction that I haven't completely wrapped my head around. Docker is very commonly used as an environment - you put a bunch of software (e.g., samtools, bwa aligner, etc.) and then issue commands to the container with custom things. Singularity, in my mind, to be truly a reproducible thing is more of the workflow step or script. It will have the software inside, but better should have those same commands represented with internal modularity. I could arguably completely do away with the external workflow dependency if a single binary told me how to run itself, and then had more than one entrypoint defined for each step. I wouldn't need to care about the software or components inside because my host just needs to run Singularity. A container should almost be more like a hard coded binary step instead of a "come into the environment and play around, the water's fine!" It's a little bit like the ICD 10 decision to give a unique id to every combination of things (e.g., "got hit on the road by a chicken") instead of combinations of them, eg. ("got hit" + "by chicken"). The first is harder because you represent more things (more containers), but the second isn't reproducible because if you lose "by chicken" you've lost the entire workflow. Does that make sense? What can/should we do now?

So there are two things to think about. With the current representation of a workflow, we would want Singularity to be OCI compliant, and I would propose a plan to move forward is to expect this, and contribute to Singularity itself with the mindset of "I want this to plug into AWS" or "I want this to plug into Kubernetes," etc. The backends for HPC are going to be good to go with just a SLURM or SGE backend, and then commands to load and run/exec a Singularity container. When the time comes and Singularity supports services, then we can start to develop (I think) the singularity backend configuration for cromwell, with clean commands to get statuses, start and stop, and otherwise integrate into the software. You guys seem pretty busy, so likely your best bet would be to just wait, because the community is going in that direction anyway.

The other representation is to rethink this. An approach that I like is to move away from micro managing the workflow / software, and to set requirements for the data. If you set standard formats (meaning everything from the organization of files down to the headers of a data file) on the data itself, then the software gets built around that. A researcher can have confidence that the data he is collecting will work with software because it's validated to the format. The developers can have confidence their tools will work with data because of that same format. A new graduate student knows how to develop a new tool because there are nicely defined rules. A good example is to look at the BIDS (brain imaging data structure) that (has several file formats under it) but it revolutionizing how brain imaging analysis is done. (e.g, take a look at https://www.openneuro.org. Development of my Thinking

Finally, I want to share how I came to the thinking above. Here are the steps that I've taken in the last few weeks, and resulting thoughts from them. I started with this issue board actually, and a general goal to "Add Singularity to Cromwell." Ok. Question 1: How do I develop Cromwell?

It first was hard for me to know where to start to develop Cromwell, because the docs just went into how to compile it on a host. So it made sense to make it easy for the developer to develop Cromwell so I made a Dockerfile to do that:

Woohoo merged! We needed to have tests too, so I followed up on that:

But unfortunately it was decided that CircleCI was too new / needed to learn stuff (this is ok!) so it's going to be closed. Question 2: How do we add a Singularity backend?

But this is actually ok, because we realize that we don't need to add Singularity to Cromwell proper, it can just be a backend! But I didn't understand wdl, or any of the formats, so my crew in Cherry lab gave me a solid repo to startwith, and then it started to click!

I was waiting for the Dockerfile test PR to pass, but realized it probably wouldn't, so I jumped on adding the example backend workflows (still without totally understanding what/why/how, but figuring out as I went):

Question 3: But what about Cromwell+Singularity on Travis?

I got confused again when there were requests for additional tests https://github.com/broadinstitute/cromwell/pull/4039#issuecomment-416313519 (and something entirely different) that it made me step back. I had this growing feeling that started to solidify that there are too many layers. I am developing things and I still don't understand (or think Singularity is ready yet) to be any kind of backend. I'm forcing a dog into a cat shaped hole just because this is the hole I'm supposed to fill. Is that a good idea? I've lost sight of what the tool is trying to do. Cromwell is trying to make it easy to run a Singularity container. But if that's the case, then why has this command:

singularity run shub://vsoch/hello-world

turned into needing Cromwell (java and the jar), an inputs json file, a wdl specification, a backend configuration, and a runtime command that I can't seem to remember, and then the entire thing takes much longer than an instance to echo a tiny Rawwwwr! If this is the goal we are going for, is this making life easier for the scientist? If I'm a programmer person, and this is the minimum I am allowed for this to just run a simple container, what happens when it gets harder? I realized that without a proper services API, singularity is no more special than python, bash, samtools, it's just a binary.

And I realize also that it's easy to get caught up in details like "Should we use Travis or Circle?" Does it work on Amazon with this kind of input? And there will always be bugs! But I think the forest is being a bit lost for the trees. Question 4: What is the direction to go in?

You can probably take what I'm saying with a grain of salt because I'm new to this entire universe, and there is so much invested there is no turning back or rethinking. But all of this seems too complicated, and too hard. What is needed is a solution that is just really stupid and simple. You have a container that understands its data. You point the container at a dataset and run it. You outsource the workflow part to the technologies that big players are building already.

This definitely isn't a "throw hands in the air" sort of deal, because most of this stuff is working already it seems? I don't know if this perspective is useful, but as a new person (outsider) I wanted to offer it because if I'm confused and find this hard, probably others are too. And minimally it's good for awareness and discussion? I'm definitely happy to help however I can! But I'd really like to not try shoving dogs into cat holes, it's a very messy business. 🐱 🐢 πŸ•³ πŸ†˜

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/cromwell/issues/2177#issuecomment-416418214, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFJr_ph9FxGcqWOBgVwOXg-0Ojw_0Ydks5uVJkRgaJpZM4M_nwq .

tbenst commented 4 years ago

Just wondering what the latest is on this? I'm trying to figure out how to run Singularity containers with Cromwell. The documentation presumes that I want to run a docker container with Singularity, but I already have a .sif file.

tbenst commented 4 years ago

Ok, I got singularity working, although I'm new to cromwell so please let me know if there's a better way!

hello.wdl:

task hello {
  command {
    echo 'Hello world!'
  }

  runtime {
    image: "~/test.sif"
  }

  output {
    File response = stdout()
  }
}

workflow test {
  call hello
}

local.conf:

include required(classpath("application"))
backend {
  default = singularity
  providers {
    singularity {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"

      config {
        job-shell="/bin/sh"
        run-in-background = true        
        runtime-attributes = """
            String? image
        """
        submit = """
            singularity exec ${image} ${job_shell} ${script}
        """
      }
    }
  }
}