Closed geoffjentry closed 7 years ago
@geoffjentry One time I crashed like this (my WDL submission I guess.....did it crash cromwell?) in FC because the number of lines in the file was very big. I ended up re-working things so that read_lines("my_file.dat") was turned into an array of files.
This seems like an excellent idea -- do you have a suggestion for how big this should be? I'd vote for several-x the size of anything the joint calling pipeline does (which might be our biggest user of this?)
I talked to the greens a little bit today and I'm thinking the way to go about this is per-function limits.
read_bool()
- 5 chars ("false")
read_int()
- 19 chars (if I did my sleuthing & counting correct for MAX_INT)
read_float()
- 50 chars (made up, but maxint plus a lot of decimal places)
read_string()
- 128K (because it's 2x what CWL gives you ;) )
Where it gets tricky are the larger objects. I'm less certain on what to do here:
read_lines()
- I'm thinking this should be the same as read_string()
read_json()
- Same? I think?
read_[tsv|map|object]()
- No idea but from talking to the greens there are certainly reasonable use cases which would be much larger than the above. It sounded like if we capped this at 1MB it'd easily cover everything they did. From a workbench safety-from-users perspective this seems like the class of functions most likely to be abused as well.
Curious about the "should be in the WDL spec too" part. IMO this would be most obviously a Cromwell config file option
I could even see an argument that this ought to be backend specific. So what if I'm trying to do silly things locally...
But yes, absolutely we should have some protection against DOS-scale read_string()
s in (e.g.) Firecloud
Disagree, the point of a wdl is to provide the same result. This could radically alter a result of it were configurable, and that's exactly the reason it is specified in cwl
@cjllanwarne note that what I just said in #1777 could be used against my disagreement although I think handling all the different bifurcations here would be trickier than the one over there.
At the end of the day the number one priority is that a WDL should always produce the exact same result regardless of where it is run (possibly excepting docker, and that's part of what rubs me the wrong way about runtime attrs). The number two priority should be that a user can't possibly take down a server w/ a rogue wdl.
After that anything can be tunable in terms of how a server achieves the first one whilst not allowing the second one.
Prioritized as Medium as this seems like low-hanging fruit (low effort and higher impact), less on impending need.
One key question is what should happen if the file is too large? Just silently continue? Error? Provide some form of feedback to the user? Emitting to the cromwell server log seems useless for most of our user personas.
Error... not reading the whole file probably will not produce the right behavior in the pipeline being run
On Apr 15, 2017, at 11:41 AM, Jeff Gentry notifications@github.com wrote:
One key question is what should happen if the file is too large? Just silently continue? Error? Provide some form of feedback to the user? Emitting to the cromwell server log seems useless for most of our user personas.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@kcibul That's my take but to give an alternative, the motivating use case for this in CWL was that it allows someone to e.g. read a SAM header, which by definition wouldn't be the full file. IOW it is seen as feature, not bug, that one can read a partial file.
Although now that I look the CWL analogy doesn't hold up. They have a max-64K chunk attached to their File object for similar purposes as this (input into expressions) but not to enable further typing like we do.
Error it is.
Ideally size would be checked prior to doing anything.
For instance: If there's a 1T file sitting on GCS and we're trying to read_bool()
on it, it'd be nice to know that we shouldn't even bother downloading the file.
Given that this is being considered an error (I'd prefer "Failure") rather than truncation, I'm going to try to push one more time the idea that this should be a configuration option rather than some hard-coded magic number in WDL (which will inevitable need to be raised at some point, and then will different WDL versions have different read_string
limits even when run on the same Cromwell?)
If we were truncating then the worry about different results would be correct, end of story. But what we could be saying now is "Failure. Your WDL engine isn't able to process this workflow because the file is too large. Try increasing resources or restructuring your WDL"... not really any different from "Failure. You didn't have enough memory to run that scatter so wide...", or indeed any other resource constraint.
This is just me TOL of course... but it does seem a lot more elegant to me than having a special case for file sizes written in the WDL spec but every other resource limit being implicit and left up to the engine.
I think I agree with Chris re: the limit should be configurable
As long as it is pitched as a cromwell feature and not part of WDL, I agree.
Geraldine, how does that align with workflows being portable? If a user takes advantage of that one Cromwell, and then it dies in dnanexus because they have a different setting (or implementation!)
What about having something in the spec about a minimum that must be supported? Therefore if users move above that they know their workflow might not be compliant with other servers. We can still have the knob of course in cromwell
On Apr 18, 2017, at 9:42 AM, Jeff Gentry notifications@github.com wrote:
As long as it is pitched as a cromwell feature and not part of WDL, I agree.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Fwiw while CWL has maximums TES has minimum required like what you suggest.
I see both sides on it. Portability is helpful but if someone has a workflow which only succeeds because their server has juiced limits and they do t realize that, it isn't super helpful
I would support spec-mandated minimums, supplemented by knobs in Cromwell.
For the user experience, a key thing you can do is write really clear error messages. Ie don't make it die with just "File was too big"; add a note in there about where to get more info/what can be done to get past this. @katevoss can help with this; she has strong feelings about microcopy as I'm sure you know by now ;)
To summarize, these will be the spec mandated minimums. There'll also be configuration parameters in Cromwell to tune these higher if one wants. Cromwell will attempt to check file size prior to reading it or pulling it across the network for cloud filesystems. Error messages should be very clear and checked past @katevoss
read_bool()
- 5 chars
read_int()
- 19 chars
read_float()
- 50 chars
read_string()
- 128K
read_lines()
- 128K
read_json()
- 128K
read_[tsv|map|object]()
- 1MB
read_json
gets a 1K bonus ?
It's all the extra braces.
Will fix when not on my phone
Finalized Ticket
For the
read_x()
functions, limit the size of data which can be read in. Check the file size prior to attempting to read as reading can also incur network charges. Any attempt to read an oversized file will result in a failure. See notes from Geraldine down below about how to phrase the error.read_bool()
- 5 charsread_int()
- 19 charsread_float()
- 50 charsread_string()
- 128Kread_lines()
- 128Kread_json()
- 128Kread_[tsv|map|object]()
- 1MBOriginal text for posterity Note to @kcibul - this is both a request for the WDL spec and Cromwell's implementation, not just the latter.
Our read_X() functions (e.g. read_int, read_string, read_lines) have a flaw in that they'll dutifully read in the entire file. The problem with this is that a malicious and/or less than careful user could take down Cromwell or another WDL implementing engine (unless it was written in Erlang!) by reading in an enormous file. For instance a careless user might
read_string(SomeEnormousBam)
.The point of these functions are more of a convenience, if users are trying to sling around huge chunks of data they should be passing files around. In particular things like
read_boolean()
are particularly egregious as it will only interprettrue
orfalse
. Similarly it seems unlikely that someone would have a valid use case to read in the first 9 billion digits of Pi into aFloat
.I propose that we place a cap on how much data we will read to some reasonable amount of data (e.g. CWL uses 64 KiB, which seems a little excessively small to me).