broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
994 stars 360 forks source link

Allow basename() and sub() to work on optional files #6840

Closed aofarrel closed 2 years ago

aofarrel commented 2 years ago

basname() and sub() appear to only work on non-optional types, even if they are being used to set an optional variable. Nothing in the spec seems to imply that basename() and sub() should only work on requireds, and indeed, other implementations of WDL do not have this restriction. As a result, it seems I have to make Cromwell-specific workarounds involving select_first() which shouldn't really be necessary.

Example using basename()

This code should work just fine on Cromwell, but doesn't.

task reference_prepare {
    input {
        # You need to define either this...
        File? reference_fa_file

        # Or both of these.
        File?   reference_zipped_directory
        String? reference_fa_filename_in_zipped_directory
    }

    # get the basename of the reference
    String? basename_reference = basename(reference_zipped_directory)

    command <<<
        set -eux -o pipefail

        if [[ ! "~{reference_zipped_directory}" = "" ]]
        then
            cp ~{reference_zipped_directory} .
            unzip ~{basename_reference}
        fi

        # do other things here
    >>>
}

workflow foo {
    call reference_prepare
}

womtool validate output

Results in a fatal error.

Failed to process task definition 'reference_prepare' (reason 1 of 1): Failed to process expression 'basename(reference_zipped_directory)' (reason 1 of 1): Invalid parameter 'IdentifierLookup(reference_zipped_directory)'. Expected 'File' but got 'File?'

miniwdl check output (for comparison)

Results in some warnings.

refprep.wdl workflow foo call reference_prepare task reference_prepare (Ln 24, Col 3) UnusedDeclaration, nothing references File? reference_fa_file (Ln 28, Col 3) UnusedDeclaration, nothing references String? reference_fa_filename_in_zipped_directory (Ln 32, Col 2) UnnecessaryQuantifier, unnecessary optional quantifier (?) for non-input String? basename_reference

(The actual workflow also runs successfully on miniwdl.)

Workaround

Because all of our variables are optional, there isn't a required variable we can fall back on. But because we only use basename_reference if the variable it relies on exists anyway, there's no point in having a fallback value that actually means anything. Therefore, the best workaround seems to be:

String? basename_reference = basename(select_first([reference_zipped_directory, "my hovercraft is full of eels"]))

Proposed implementation

I think continuing to disallow this makes sense:

String basename_reference = basename(reference_zipped_directory)

Other executors do allow that, but I think it makes sense for a required variable to not be allowed to rely on an optional variable. What I'm asking for is when basename() and sub() are being used to set an optional variable, that they can act upon optional variables, eg, this should work:

String? basename_reference = basename(reference_zipped_directory)

pshapiro4broad commented 2 years ago

I think select_first is the right approach although I would use "" for the default value. If you know that the caller won't provide a value at all when calling your task (vs passing through an optional value from its caller), you could use a default value:

task reference_prepare {
    input {
        # You need to define either this...
        File? reference_fa_file

        # Or both of these.
        File?   reference_zipped_directory
        String reference_fa_filename_in_zipped_directory = ""
    }
    ...
}

Regarding the difference in behavior of cromwell vs miniwdl, I don't think miniwdl's support for this form is backed by the WDL spec. As I see it, the spec is explicit with regard to optional parameters to Standard Library functions. For example, the signature for size() is Float size(File?|Array[File?], [String]). Since the signature for basename() is String basename(String|File, [String]), it doesn't look like it should support String? as an input.

aofarrel commented 2 years ago

If you know that the caller won't provide a value at all when calling your task (vs passing through an optional value from its caller), you could use a default value:

Although this would prevent me from using select_first() as often, in the actual workflow (which I didn't post, as it's monstrous compared to the toy example), I also have to use defined() to build the path of an optional TSV file which is either going to be in the zipped directory, or passed in directory, or not used at all. Setting an default value means I now have to check for equality with an empty string instead of using defined. In the end it'd be just as verbose and probably a little harder to debug then select_first().

Regarding the difference in behavior of cromwell vs miniwdl, I don't think miniwdl's support for this form is backed by the WDL spec. As I see it, the spec is explicit with regard to optional parameters to Standard Library functions. For example, the signature for size() is Float size(File?|Array[File?], [String]). Since the signature for basename() is String basename(String|File, [String]), it doesn't look like it should support String? as an input.

I'm a little confused, where are you seeing this? sub() is titled String sub(String, String, String) and size() is titled Float size(File, [String]) in the 1.0 spec. I get that sub() has examples with compound types but I took that to mean "here's some examples with arrays plus an optional for comparison" since File? isn't a compound type (if I am understanding this correctly). If size() accepts optionals in spite of the spec continuously saying File instead of File? except in one example, I don't see why basename() and sub() cannot.

aednichols commented 2 years ago

That's interesting, I did not know that size() returns 0.0 when you supply it a nonexistent File?.

In fairness, basename() does not have an extra "Acceptable compound input types" section like size() does.

aofarrel commented 2 years ago

I saw that, but since File? isn't a compound type, I'm under the impression that example was for comparison to the actual compound types. That's the inconsistency I'm not groking -- since File? isn't an example of a compound type, it seems that example's existence implies that something that accepts a File should also accept a File?, which is indeed the case with Cromwell's integration for size() but not basename() or sub().

If we relied entirely on what the spec's headings and examples said as being the only acceptable inputs, then basename() wouldn't work on File at all because the spec says it actually takes in a String, not a File, and has no File examples. Since basename() works on File it seems Cromwell is already going beyond what the 1.0 spec explicitly says.

mlin commented 2 years ago

Sorry @aofarrel we need to look into this further, but on a brief first glance I'd bet this is an oversight in the miniwdl type checker, elaborated in the linked issue. thanks @pshapiro4broad for the slack ping

aofarrel commented 2 years ago

Perhaps the issue is the spec itself? There does seem to be a lot of small internal inconsistencies that lead to this "this function doesn't say it takes in File but it works anyway, but not File?, and also this other function doesn't say it can take in File? but it can take in File?" situation.

As a side note, the 1.1 spec seems to be less open to interpretation. @pshapiro4broad Is there any timeline on Cromwell support for WDL 1.1?

aednichols commented 2 years ago

A conservative interpretation of the spec is that only File works unless noted otherwise... which it is noted for size(), only.

1.1 support is a top request and is heard!

aofarrel commented 2 years ago

At the risk of derailing this into a 1.1 thread, I have heard that WDL 1.1 adds support for directory outputs (which would completely sidestep like 50% of the issues I currently have when writing WDLs, including this one involving basename/sub/select_first) but I don't see that on the 1.1 spec -- is that a 1.1 feature or a development (1.2?) feature?

pshapiro4broad commented 2 years ago

I'm a little confused, where are you seeing this?

@aofarrel Sorry, I was looking at the 1.1 spec. The section I was looking at is Float size(File?|Array[File?], [String]).

Is there any timeline on Cromwell support for WDL 1.1?

I work on other things at the Broad, not on cromwell. I'd like to see 1.1 support in cromwell too!

aofarrel commented 2 years ago

So I've run across into an even more explicit scenario where File? probably should be accepted (or coerced). This doesn't work:

String? tsv_arg = if defined(tsv_file_input) then basename(tsv_file_input) else ""

basename() is only going to be run if File? tsv_file_input is defined. Wouldn't it make sense for that if-block to effectively coerce the File? into a File in the context of the then-block? Generally speaking when I write a program that says "if variable is defined, do something with variable" to just work, rather than throwing an error when the variable isn't defined, much less throwing an error when the variable might not be defined (which is what appears to be happening here).

pshapiro4broad commented 2 years ago

Interesting, this looks like another case where miniwdl diverges from cromwell (and the WDL spec too?). As far as I know the WDL spec doesn't support support type promotion of X? to X from if defined(x) ... x expressions, even though it could be done. @mlin

The usual way I've done this in WDL is to use select_first(). Something like this should work:

String? tsv_arg_maybe = if defined(tsv_file_input) then basename(tsv_file_input) else ""
String tsv_arg = select_first([tsv_arg_maybe])
aofarrel commented 2 years ago

I'm not sure how miniwdl handles this, I'm just testing this one on Cromwell for now. But putting aside the spec, miniwdl, and Cromwell here, when I write some sort of code in any particular language, I always expect at least one of these to work:

In Cromwell-flavored WDL (perhaps all WDL?), it doesn't seem you can do any of those. The first one will throw an "Expected X but got X?" error and the other two don't seem to have equivalents. Compare that to Python, where I can explicitly do the second or third one, and implicitly do the first one. In Python, if I try to do_something() on a variable that isn't defined, Python throws a Name Error, but in Cromwell!WDL trying to do_something() on an optional throws a "Expected X but got X?" error that doesn't tell you if X is actually defined or not.

The fact that the WDL spec doesn't explicitly say that defined() can coerce a variable doesn't really matter -- I would wager that most people would expect this sort of thing to work. It seems to be a logical conclusion that if a file exists, you can do something to that file, without having to call a totally different function to create a new variable.

I did check your workaround, but it throws the same error. So, my understanding is the only way to do this in Cromwell is this:

String basename_tsv = basename(select_first([tsv_file_input, "bogus fallback value"]))
String arg_tsv = if(basename_tsv == "bogus fallback value") then "" else basename_tsv

...which just isn't intuitive. It shouldn't be so complicated to get the basename of an optional file.

Even less intuitive is the fact that if you separate out the select_first() part into a new variable, my workaround doesn't work anymore.

String maybe_tsv = select_first([tsv_file_input, "bogus fallback value"])
String basename_tsv = basename(maybe_tsv)
String arg_tsv = if(basename_tsv == "bogus fallback value") then "" else basename_tsv

_Failed to process task definition 'parse_terratable' (reason 1 of 1): Failed to process expression 'select_first([basename_tsv, basename(maybe_tsv)])' (reason 1 of 1): Invalid parameter 'IdentifierLookup(maybetsv)'. Expected 'File' but got 'String?'

In other words -- it is unnecessarily complicated to work with optionals, and the workarounds that do exist appear to be inconsistent. It seems this could all be sidestepped by having "if X exists, do something to X" logic.

(I want to make clear I'm coming at this from a perspective of someone who uses WDL heavily, mostly in the context of Terra/Cromwell, and wants wider adoption. From my experience, oddities like this are serious obstacles for both newcomers and heavy users like myself.)

aednichols commented 2 years ago

Thanks for your feedback. Since this is more of a WDL spec item than a Cromwell one, I suggest creating a new Issue, PR, or Discussion over at OpenWDL.

pshapiro4broad commented 2 years ago

Sorry, I put the select_first in the wrong place. This works:

  String tsv_arg = if defined(tsv_file_input) then basename(select_first([tsv_file_input])) else ""

My test.wdl:

version 1.0

workflow W {
  input { File? tsv_file_input }

  String tsv_arg = if defined(tsv_file_input) then basename(select_first([tsv_file_input])) else ""

  call T { input: s = tsv_arg }
  output { String out = T.out }
}

task T {
  input { String s }
  command { echo ~{s} }
  output { String out = read_string(stdout()) }
}

I checked and this works with both miniwdl run and java -jar cromwell.jar run.

aednichols commented 2 years ago

Nice, handy to have this written down for future Google searchers.

mlin commented 2 years ago

In early miniwdl development we did explore recognizing that x is non-optional within the if (defined(x)) consequent, but I believe we dropped that after some spec clarifications. That implementation is not too complicated but you can get into questions of how deeply to statically analyze the if expression, e.g. if it's a compound boolean logic that maybe you can get into and prove implies defined(x). And the select_first([x]) idiom, as illustrated in @pshapiro4broad corrected version, does work but I agree it's a very common roadbump for all WDLers.

aofarrel commented 2 years ago

@pshapiro4broad Unfortunately I'm still having issues with select_first() (apparently?) acting inconsistently.

This passes miniwdl and Cromwell:

version 1.0

task T {
    input {
        File? tsv_file_input
        String tsv_arg = if defined(tsv_file_input) then basename(select_first([tsv_file_input, "/path/to/file.txt"])) else ""
    }

    command <<<
        echo ~{tsv_arg}
    >>>

}

workflow W {
    input {
        File? tsv_file_input
    }

    call T {
        input:
            tsv_file_input = tsv_file_input
    }
}

This passes miniwdl, but fails Cromwell:

version 1.0

task T {
    input {
        File? tsv_file_input
        String foo = select_first([tsv_file_input, "/path/to/file.txt"])
        String tsv_arg = if defined(tsv_file_input) then basename(foo) else ""
    }

    command <<<
        echo ~{tsv_arg}
    >>>

}

workflow W {
    input {
        File? tsv_file_input
    }

    call T {
        input:
            tsv_file_input = tsv_file_input
    }
}

Cromwell's error is:

14:27:13.383 [main] ERROR io.dockstore.client.cli.ArgumentUtility - Problem parsing WDL file: Failed to process task definition 'T' (reason 1 of 1): Failed to process expression 'select_first([tsv_arg, if defined(tsv_file_input) then basename(foo) else ""])' (reason 1 of 1): Invalid parameter 'IdentifierLookup(foo)'. Expected 'File' but got 'String?' 14:27:13.385 [main] ERROR io.dockstore.client.cli.ArgumentUtility - wdl.draft3.parser.WdlParser$SyntaxError: Failed to process task definition 'T' (reason 1 of 1): Failed to process expression 'select_first([tsv_arg, if defined(tsv_file_input) then basename(foo) else ""])' (reason 1 of 1): Invalid parameter 'IdentifierLookup(foo)'. Expected 'File' but got 'String?'

To me, that seems to indicate that Cromwell's implentation of select_first() isn't consistently returning the same type, depending on where it is being used. It looks like in the first example it is correctly returning a String but in the second example it's returning a String?, but I don't see a meaningful difference between the two.