DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
881 stars 237 forks source link

Support filename coercion back to string #5004

Open stxue1 opened 5 days ago

stxue1 commented 5 days ago

The spec expects that filenames can be represented as a string to alter things like filename extensions: https://github.com/openwdl/wdl/blob/9c0b9cf4586508a9e6260cc5c5e562e21f625aac/SPEC.md?plain=1#L6521

We virtualize files into our own representation, so coercing back will give us a toilfile URI instead of the filename:

data_file = "toilfile:4%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-kfz8ammc%2Ffile-c91abfb5298d4a83bad4bb00d53beb55%2Ffoo.data/e946ed9a-784f-47d3-bf74-e84a23d9056d/foo.data"

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1609

adamnovak commented 5 days ago

Yeah, this is going to be tricky to implement. It looks like at least in task output sections they expect to be dealing with the same string that got coerced to a File, rather than something processed.

How will we make this work at workflow scope? We might have one decl that turns a String into a File, then several decls and tasks that depend on that File and need to use its virtualized representation, and then another decl that expects to operate on the File coerced back to the same original string, possibly in the same expression as working with the File's contents.

We could do something like moving virtualization to the boundaries of tasks, so we virtualize a version of the File that we use to pass to tasks, and then we leave a non-virtualized version around for use in decls?

Or we could tack the original non-virtualized path of the File onto it with setattr() so we can get at it later for string coercion.