broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
987 stars 357 forks source link

Add limits to read_X() functions #1762

Closed geoffjentry closed 7 years ago

geoffjentry commented 7 years ago

Finalized Ticket

For the read_x() functions, limit the size of data which can be read in. Check the file size prior to attempting to read as reading can also incur network charges. Any attempt to read an oversized file will result in a failure. See notes from Geraldine down below about how to phrase the error.

read_bool() - 5 chars read_int() - 19 chars read_float() - 50 chars read_string() - 128K read_lines() - 128K read_json() - 128K read_[tsv|map|object]() - 1MB

Original text for posterity Note to @kcibul - this is both a request for the WDL spec and Cromwell's implementation, not just the latter.

Our read_X() functions (e.g. read_int, read_string, read_lines) have a flaw in that they'll dutifully read in the entire file. The problem with this is that a malicious and/or less than careful user could take down Cromwell or another WDL implementing engine (unless it was written in Erlang!) by reading in an enormous file. For instance a careless user might read_string(SomeEnormousBam).

The point of these functions are more of a convenience, if users are trying to sling around huge chunks of data they should be passing files around. In particular things like read_boolean() are particularly egregious as it will only interpret true or false. Similarly it seems unlikely that someone would have a valid use case to read in the first 9 billion digits of Pi into a Float.

I propose that we place a cap on how much data we will read to some reasonable amount of data (e.g. CWL uses 64 KiB, which seems a little excessively small to me).

eddiebroad commented 7 years ago

@geoffjentry One time I crashed like this (my WDL submission I guess.....did it crash cromwell?) in FC because the number of lines in the file was very big. I ended up re-working things so that read_lines("my_file.dat") was turned into an array of files.

kcibul commented 7 years ago

This seems like an excellent idea -- do you have a suggestion for how big this should be? I'd vote for several-x the size of anything the joint calling pipeline does (which might be our biggest user of this?)

geoffjentry commented 7 years ago

I talked to the greens a little bit today and I'm thinking the way to go about this is per-function limits.

read_bool() - 5 chars ("false") read_int() - 19 chars (if I did my sleuthing & counting correct for MAX_INT) read_float() - 50 chars (made up, but maxint plus a lot of decimal places) read_string() - 128K (because it's 2x what CWL gives you ;) )

Where it gets tricky are the larger objects. I'm less certain on what to do here:

read_lines() - I'm thinking this should be the same as read_string() read_json() - Same? I think? read_[tsv|map|object]() - No idea but from talking to the greens there are certainly reasonable use cases which would be much larger than the above. It sounded like if we capped this at 1MB it'd easily cover everything they did. From a workbench safety-from-users perspective this seems like the class of functions most likely to be abused as well.

cjllanwarne commented 7 years ago

Curious about the "should be in the WDL spec too" part. IMO this would be most obviously a Cromwell config file option

cjllanwarne commented 7 years ago

I could even see an argument that this ought to be backend specific. So what if I'm trying to do silly things locally... But yes, absolutely we should have some protection against DOS-scale read_string()s in (e.g.) Firecloud

geoffjentry commented 7 years ago

Disagree, the point of a wdl is to provide the same result. This could radically alter a result of it were configurable, and that's exactly the reason it is specified in cwl

geoffjentry commented 7 years ago

@cjllanwarne note that what I just said in #1777 could be used against my disagreement although I think handling all the different bifurcations here would be trickier than the one over there.

At the end of the day the number one priority is that a WDL should always produce the exact same result regardless of where it is run (possibly excepting docker, and that's part of what rubs me the wrong way about runtime attrs). The number two priority should be that a user can't possibly take down a server w/ a rogue wdl.

After that anything can be tunable in terms of how a server achieves the first one whilst not allowing the second one.

katevoss commented 7 years ago

Prioritized as Medium as this seems like low-hanging fruit (low effort and higher impact), less on impending need.

geoffjentry commented 7 years ago

One key question is what should happen if the file is too large? Just silently continue? Error? Provide some form of feedback to the user? Emitting to the cromwell server log seems useless for most of our user personas.

kcibul commented 7 years ago

Error... not reading the whole file probably will not produce the right behavior in the pipeline being run

On Apr 15, 2017, at 11:41 AM, Jeff Gentry notifications@github.com wrote:

One key question is what should happen if the file is too large? Just silently continue? Error? Provide some form of feedback to the user? Emitting to the cromwell server log seems useless for most of our user personas.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

geoffjentry commented 7 years ago

@kcibul That's my take but to give an alternative, the motivating use case for this in CWL was that it allows someone to e.g. read a SAM header, which by definition wouldn't be the full file. IOW it is seen as feature, not bug, that one can read a partial file.

geoffjentry commented 7 years ago

Although now that I look the CWL analogy doesn't hold up. They have a max-64K chunk attached to their File object for similar purposes as this (input into expressions) but not to enable further typing like we do.

Error it is.

geoffjentry commented 7 years ago

Ideally size would be checked prior to doing anything.

For instance: If there's a 1T file sitting on GCS and we're trying to read_bool() on it, it'd be nice to know that we shouldn't even bother downloading the file.

cjllanwarne commented 7 years ago

Given that this is being considered an error (I'd prefer "Failure") rather than truncation, I'm going to try to push one more time the idea that this should be a configuration option rather than some hard-coded magic number in WDL (which will inevitable need to be raised at some point, and then will different WDL versions have different read_string limits even when run on the same Cromwell?)

If we were truncating then the worry about different results would be correct, end of story. But what we could be saying now is "Failure. Your WDL engine isn't able to process this workflow because the file is too large. Try increasing resources or restructuring your WDL"... not really any different from "Failure. You didn't have enough memory to run that scatter so wide...", or indeed any other resource constraint.

This is just me TOL of course... but it does seem a lot more elegant to me than having a special case for file sizes written in the WDL spec but every other resource limit being implicit and left up to the engine.

vdauwera commented 7 years ago

I think I agree with Chris re: the limit should be configurable

geoffjentry commented 7 years ago

As long as it is pitched as a cromwell feature and not part of WDL, I agree.

kcibul commented 7 years ago

Geraldine, how does that align with workflows being portable? If a user takes advantage of that one Cromwell, and then it dies in dnanexus because they have a different setting (or implementation!)

What about having something in the spec about a minimum that must be supported? Therefore if users move above that they know their workflow might not be compliant with other servers. We can still have the knob of course in cromwell

On Apr 18, 2017, at 9:42 AM, Jeff Gentry notifications@github.com wrote:

As long as it is pitched as a cromwell feature and not part of WDL, I agree.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

geoffjentry commented 7 years ago

Fwiw while CWL has maximums TES has minimum required like what you suggest.

I see both sides on it. Portability is helpful but if someone has a workflow which only succeeds because their server has juiced limits and they do t realize that, it isn't super helpful

vdauwera commented 7 years ago

I would support spec-mandated minimums, supplemented by knobs in Cromwell.

For the user experience, a key thing you can do is write really clear error messages. Ie don't make it die with just "File was too big"; add a note in there about where to get more info/what can be done to get past this. @katevoss can help with this; she has strong feelings about microcopy as I'm sure you know by now ;)

geoffjentry commented 7 years ago

To summarize, these will be the spec mandated minimums. There'll also be configuration parameters in Cromwell to tune these higher if one wants. Cromwell will attempt to check file size prior to reading it or pulling it across the network for cloud filesystems. Error messages should be very clear and checked past @katevoss

read_bool() - 5 chars read_int() - 19 chars read_float() - 50 chars read_string() - 128K read_lines() - 128K read_json() - 128K read_[tsv|map|object]() - 1MB

Horneth commented 7 years ago

read_json gets a 1K bonus ?

geoffjentry commented 7 years ago

It's all the extra braces.

Will fix when not on my phone