TRS should be able to serve binary files

coverbeck commented 1 year ago

A workflow could have binary files, such as compiled code, images, etc.

TRS does not have a way to serve up binaries; the FileWrapper object returns a JSON object with a String content field -- if the data were binary, it would need to be escaped.

Some brainstorming:

For JSON responses, base64 encoding the binary data seems like the best solution. But clients would then need to know that the content is base64-encoded, so that would entail adding another field to FileWrapper that indicates if the content field is encoded or not, so the client will know whether to decode it.
What to do if one of the PLAIN types, e.g., PLAIN_WDL is requested? An error response, or return the data with the content-type set accordingly?
A different idea is that the only way to fetch binary data is by fetching the workflow as a bundle. In practice, I think most clients should be using the zip endpoint, but it does seem incomplete to only be able to fetch and know about binaries this way.

┆Issue is synchronized with this Jira Story ┆Project Name: Zzz-ARCHIVE GA4GH tool-registry-service ┆Issue Number: TRS-64

uniqueg commented 1 year ago

Good point. But I would like to point out that the current specs are at least consistent in that binary files appear to not be supported on purpose:

Out of the available ToolFile.file_type choices, all except for OTHER should generally (always?) be in human-readible format for the supported workflow types.
There is no endpoint to allow fetching individual files of ToolFile.file_type OTHER, whereas such endpoints exist for the other types, specifically:
- PRIMARY_DESCRIPTOR: GET /tools/{id}/versions/{version_id}/{type}/descriptor
- SECONDARY_DESCRIPTOR: GET /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}
- TEST_FILE: GET /tools/{id}/versions/{version_id}/{type}/tests
- CONTAINERFILE: GET /tools/{id}/versions/{version_id}/containerfile
The FileWrapper schema is only used in the endpoints mentioned above, i.e., there is no such annotation to be retrieved for types of file OTHER, binary or not.

While not explicit, the description provided with the FileWrapper schema seems to suggest that it is only defined for ToolFile.file_types other than OTHER:

 A file provides content for one of
- A tool descriptor is a metadata document that describes one or more tools.
- A tool document that describes how to test with one or more sample test JSON.
- A containerfile is a document that describes how to build a particular container image. Examples
include Dockerfiles for creating Docker images and Singularity recipes for Singularity images

The PLAIN_ types are only described for descriptors in schema DescriptorTypeWithPlain, and text/plain content types are only (though somewhat inconsistently, see https://github.com/ga4gh/tool-registry-service-schemas/issues/162 & https://github.com/ga4gh/tool-registry-service-schemas/issues/229) defined for responses in endpoints that return human-readable files. (Personally though, I anyway think that this is quite a messy area of the specs, and I would strongly support getting rid of the PLAIN_ types altogether, as argued here: https://github.com/ga4gh/tool-registry-service-schemas/issues/210)
The description of the GET /tools/{id}/versions/{version_id}/{type}/files endpoint to retrieve all files subtly implies in its description that the use of the endpoint with the application/json format is meant to be used for retrieving workflow descriptors only:

Get a list of objects that contain the relative path and file type. The descriptors are intended for use with the /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path} endpoint. Returns a zip file of all files when format=zip is specified.

So with this in mind, let me address your points:

What to do if one of the PLAIN types, e.g., PLAIN_WDL is requested? An error response, or return the data with the content-type set accordingly?

I may be wrong, but I know of no workflow engine supporting the execution of workflows defined in one of the currently enumerated 'DescriptorTypes that allows executing workflows described in binary format. And even if, that would probably constitute highly engine-specific behavior. So if we were able to pull such binaries, how could we be sure it's in a format that a given engine supports. Shouldn't TRS then specify the binarization mechanism (e.g., a compression type) to be useful? Of course the case may be different for other workflow types that are currently not explicitly supported by TRS (i.e., not enumerated in the 'DescriptorType schema).

A different idea is that the only way to fetch binary data is by fetching the workflow as a bundle. In practice, I think most clients should be using the zip endpoint, but it does seem incomplete to only be able to fetch and know about binaries this way.

While I generally agree that there is an inconsistency here, the inability to fetch individual files is not restricted to binary files only, as described above. No file of ToolFile.file_type OTHER can currently be retrieved individually, outside of the GET /tools/{id}/versions/{version_id}/{type}/files endpoint with content type application/zip.

So, personally, I think we should not worry about supporting binary files unless we have a real-world use case that demonstrates a strong need for binary descriptor, test or container recipe files. Otherwise the route of fetching a ZIP archive of all files seems a perfectly acceptable way of dealing with files of any type, including binary, to me. Perhaps we could make it clearer though, both in the documentation and the specs, that clients are expected to make use of this mechanism if they want to fetch a workflow for actual execution, and that the other endpoints are more designed for allowing quick introspection of workflow descriptors, container recipes etc.

denis-yuen commented 1 year ago

I may be wrong, but I know of no workflow engine supporting the execution of workflows defined in one of the currently enumerated 'DescriptorTypes that allows executing workflows described in binary format. And even if, that would probably constitute highly engine-specific behavior.

We're pondering Nextflow which has the option of binary files https://www.nextflow.io/docs/latest/faq.html#how-do-i-invoke-custom-scripts-and-tools

Otherwise the route of fetching a ZIP archive of all files seems a perfectly acceptable way of dealing with files of any type, including binary, to me. Perhaps we could make it clearer though, both in the documentation and the specs, that clients are expected to make use of this mechanism if they want to fetch a workflow for actual execution, and that the other endpoints are more designed for allowing quick introspection of workflow descriptors, container recipes etc.

That said, the zip archive/improving documentation route makes sense to me

uniqueg commented 1 year ago

But this is not about descriptor files @denis-yuen. Surely workflows in the wild rely on scripts or executables being shipped with the workflows in the way described in the Nextflow documentation. A similar mechanism exists for Snakemake as well, and probably other workflow engines, including support for executing these executables in the indicated containers. So by all means, if you fetch a workflow and expect to be able to run it, you will need to make sure to fetch files of type OTHER as well.

By the way, if we go the doc improvement route, we might use the opportunity to more clearly specify that files of type TEST_FILE can be JSON only (unless we want to extend this to include YAML as well). Otherwise people may think that, e.g., input files for tests (arguably also test files), that may very well be in binary format, should be TEST_FILEs, too.

coverbeck commented 1 year ago

Thanks for the thoughtful response @uniqueg .

There is no endpoint to allow fetching individual files of ToolFile.file_type OTHER

Ah, good point, I hadn't paid attention and wasn't aware of that. That does simplify things.

So then, I'm good with fetching via archive only and improving the doc.

A digression that this discussion about archives triggers for me... I suspect file permissions may be important in some cases for running the workflow and we may want to make tar an additional format option to handle that.

denis-yuen commented 1 year ago

A digression that this discussion about archives triggers for me... I suspect file permissions may be important in some cases for running the workflow and we may want to make tar an additional format option to handle that.

:thinking:

ga4gh / tool-registry-service-schemas

TRS should be able to serve binary files #235