Engine function to get workflow identification

broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

http://cromwell.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

999 stars 360 forks source link

Engine function to get workflow identification #1575

Open geoffjentry opened 8 years ago

geoffjentry commented 8 years ago

There was a user request from @delocalizer on Gitter where he wanted to be able to reference the cromwell workflow ID within his WDL.

While we'd want to keep the cromwellian implementaiton of a workflow ID concept separate from WDL, it seems fair to have a function in WDL which returns an engine-specific identifier for that workflow and Cromwell would just happen to implement it by dropping in the workflow's UUID

cjllanwarne commented 8 years ago

NB currently all engine functions are deterministic w.r.t. inputs. We should consider how much we care that call caching might go wrong if a declaration or command string contains this function

mcovarr commented 7 years ago

how should this work wrt subworkflows?

cjllanwarne commented 7 years ago

IMO this should always return the root workflow ID (because that's the only user-facing one)

katevoss commented 7 years ago

@geoffjentry Why would a user want to reference the Cromwell workflow ID in a WDL? What's the use case?

geoffjentry commented 7 years ago

From gitter: "The use case is a workflow with a final task that queries the cromwell API for metadata about itself - for updating a database with a record of what was done."

I've seen other people ask for this fwiw

katevoss commented 7 years ago

As a workflow runner, I want to be able to reference Cromwell's workflow ID in a WDL, so that I can programmatically query for metadata about that workflow.

Effort: Small
Risk: Small
Business value: Medium

patmagee commented 7 years ago

@katevoss I previously asked about this as well. Our use case had more to do with tracking individual costs associated with Jobs initiated within a Task, but not managed by cromwell at all (Ie add labels to a google api call from within a wdl task).

The response I received back was that it would be considered, but it creates a NonDeterministic task that will be different each time you call it with the same parameters

I wonder if certain variables should be labelled as volatile and will always invalidate a chache hit?

geoffjentry commented 7 years ago

@katevoss It'd be easy to do. It'd be useful for some folks, as you can see by @patmagee chiming in. It's not without issue - e.g. the call caching issue that @cjllanwarne pointed out (potentially solvable by syntax a la @patmagee).

Another thing to consider is if we're starting to blend implementation with syntax - by doing this you're requiring a WDL implementation to have a notoin of workflow ID. I can't imagine one not having that but I think we need to be careful to not be overly prescriptive about implementation.

patmagee commented 7 years ago

Another option would be a cromwell speicifc thing. With cromwell, you could reserve a variable at the level of the workflow called String cromwell_workflow_id. When cromwell is resolving inputs, it basically would set this to the value of the current cromwell id.

We would not need to change syntax, just document what vairable names are reserved for cromwell.

geoffjentry commented 7 years ago

@patmagee to be clear i'm not too worried about that implementation statement I made. It's just that I think this particular concept is trending in that direction so I'd like us to be careful.

EvanTheB commented 6 years ago

In case someone is looking to acheive this: The ~wdl task that runs as part of configuration has access to 'job_name'. I have passed this through as an environment variable.

submit-docker = """
        docker run \
          --entrypoint ${job_shell} \
          -e CROMWELL_JOB_NAME=${job_name} \

Side note it would be nice if the variables available to that conf script were documented - I am reverse engineering from the example configs.

EvanTheB commented 6 years ago

Further note: in the current cromwell job_name is a truncated UUIID, and it is shared between all the 'calls' of a 'workflow'. I do not know about subworkflows.

guma44 commented 5 years ago

@katevoss I would like to add our use-case for the ability to access an ID of a workflow/subworkflow.

Our users want the output of the Cromwell to be copied to their local locations and they complain that they cannot read the directory structure - I agree with them as they are data scientists and they want something useful after the pipeline finishes. As we provide the service we are responsible to provide them with something. An idea was to use croo to achieve this. It is really useful solution but it requires manual intervention and knowledge of the pipeline IDs etc. Thus I though I could split the workflow into root and two sub-workflows: do-the-job and copy-files. However to achieve this the copy-files would need to have an access to the do-the-job sub-workflow ID or at least the root workflow ID to query for the metadata. I agree it is not deterministic and it should not be. Such a task cannot be cached too.

guma44 commented 5 years ago

So, for anybody is interested how I do it now. I just parse path to any file produced by the workflow and I extract first Cromwell workflow ID in the path as a root ID. That is a bit of a hack IMHO because it requires the knowledge of the directory structure but it works. I did not put it as an environmental variable because I was afraid cluster space spoilage.

EDIT: As I mentioned in previous entry the cash needs to be invalidated for the particular task for this to work and until #1695 is not done (or some other solution is reached) this solution in principle excludes restart.