ga4gh / task-execution-schemas

Apache License 2.0
80 stars 28 forks source link

TaskParameter.contents typed as bytes #70

Closed buchanae closed 6 years ago

buchanae commented 7 years ago

Requiring UTF-8 input content seems restrictive. Bytes is less restrictive.

adamstruck commented 7 years ago

Making this field bytes will render this field unreadable in API responses. If we expect the task message to be human readable then I think this should remain a string.

buchanae commented 7 years ago

Also, while protobuf is used to document the spec, some think that TES should not require protobuf nor gRPC, and the lowest common denominator should be HTTP/JSON. In that case, using a "bytes" type might not work. Worst case, the user would need to encode/decode bytes into a string.

buchanae commented 7 years ago

Also, while protobuf is used to document the spec, some think that TES should not require protobuf nor gRPC, and the lowest common denominator should be HTTP/JSON. In that case, using a "bytes" type might not work. Worst case, the user would need to encode/decode bytes into a string.

Nevermind. Protobuf uses base64 encoding for mapping bytes to JSON. Defining the "contents" type as protobuf bytes still makes sense.

https://developers.google.com/protocol-buffers/docs/proto3#json

briandoconnor commented 6 years ago

Why is UTF-8 restrictive? Is that because you think people will need to post binary data blobs to TES rather than commands?

buchanae commented 6 years ago

I filed this in response to feedback from Aaron Kemp. Not sure what his github account is. I actually haven't found a need for this yet. I think probably the motivation is that bytes is the most flexible option and allows all possible uses (such as a binary blob).

On the other hand, utf-8 covers most cases and handles the case the common case where you'd like to view that string in a dashboard (bytes could be difficult to decode?) Also, in the case where binary is needed, using a file instead of this field would be the workaround.

I've pinged Aaron to see if he wants to make the argument here.

geoffjentry commented 6 years ago

Pinging @writeonlymemory

kemp-google commented 6 years ago

@geoffjentry @buchanae this is my Google github handle :)

As for bytes vs string: if you want to use a string, this should probably be called something like 'plain_text' not 'content'. 'content' implies to me I can put binary data (eg, a BAM index, a gzip archive, etc) here, but if implementations are treating it as UTF8 strings, it will be unfortunate.

Up to you to decide whether the ability to include arbitrary file content outweighs a desire to make the message human readable.

buchanae commented 6 years ago

In short, I think binary data in this field is an edge case which isn't worth the tradeoff in dev. effort and complexity.


The original use case for content was as a shortcut for simple scripts (bash, python, etc). It isn't good for arbitrary file transfer. Personally, I've always argued against using this field at all.

Switching to bytes would take more work to handle and render as human-readable, and it would make creating a task message more complex in the common case (for example, the creator would need to base64 encode their string). I agree it'd be more flexible, perhaps even more "correct", but I don't think the tradeoff is worth it.

If someone wants to put binary content here, they could still accomplish it by having their client encode the data as base64 and have their executor decode it to binary. So, at least it's not impossible.