ga4gh / tool-registry-service-schemas

APIs for discovering genomics tools, their metadata and their containers
Apache License 2.0
30 stars 18 forks source link

TRS should be able to serve binary files #235

Open coverbeck opened 1 year ago

coverbeck commented 1 year ago

A workflow could have binary files, such as compiled code, images, etc.

TRS does not have a way to serve up binaries; the FileWrapper object returns a JSON object with a String content field -- if the data were binary, it would need to be escaped.

Some brainstorming:

┆Issue is synchronized with this Jira Story ┆Project Name: Zzz-ARCHIVE GA4GH tool-registry-service ┆Issue Number: TRS-64

uniqueg commented 1 year ago

Good point. But I would like to point out that the current specs are at least consistent in that binary files appear to not be supported on purpose:

So with this in mind, let me address your points:

What to do if one of the PLAIN types, e.g., PLAIN_WDL is requested? An error response, or return the data with the content-type set accordingly?

I may be wrong, but I know of no workflow engine supporting the execution of workflows defined in one of the currently enumerated 'DescriptorTypes that allows executing workflows described in binary format. And even if, that would probably constitute highly engine-specific behavior. So if we were able to pull such binaries, how could we be sure it's in a format that a given engine supports. Shouldn't TRS then specify the binarization mechanism (e.g., a compression type) to be useful? Of course the case may be different for other workflow types that are currently not explicitly supported by TRS (i.e., not enumerated in the 'DescriptorType schema).

A different idea is that the only way to fetch binary data is by fetching the workflow as a bundle. In practice, I think most clients should be using the zip endpoint, but it does seem incomplete to only be able to fetch and know about binaries this way.

While I generally agree that there is an inconsistency here, the inability to fetch individual files is not restricted to binary files only, as described above. No file of ToolFile.file_type OTHER can currently be retrieved individually, outside of the GET /tools/{id}/versions/{version_id}/{type}/files endpoint with content type application/zip.

So, personally, I think we should not worry about supporting binary files unless we have a real-world use case that demonstrates a strong need for binary descriptor, test or container recipe files. Otherwise the route of fetching a ZIP archive of all files seems a perfectly acceptable way of dealing with files of any type, including binary, to me. Perhaps we could make it clearer though, both in the documentation and the specs, that clients are expected to make use of this mechanism if they want to fetch a workflow for actual execution, and that the other endpoints are more designed for allowing quick introspection of workflow descriptors, container recipes etc.

denis-yuen commented 1 year ago

I may be wrong, but I know of no workflow engine supporting the execution of workflows defined in one of the currently enumerated 'DescriptorTypes that allows executing workflows described in binary format. And even if, that would probably constitute highly engine-specific behavior.

We're pondering Nextflow which has the option of binary files https://www.nextflow.io/docs/latest/faq.html#how-do-i-invoke-custom-scripts-and-tools

Otherwise the route of fetching a ZIP archive of all files seems a perfectly acceptable way of dealing with files of any type, including binary, to me. Perhaps we could make it clearer though, both in the documentation and the specs, that clients are expected to make use of this mechanism if they want to fetch a workflow for actual execution, and that the other endpoints are more designed for allowing quick introspection of workflow descriptors, container recipes etc.

That said, the zip archive/improving documentation route makes sense to me

uniqueg commented 1 year ago

But this is not about descriptor files @denis-yuen. Surely workflows in the wild rely on scripts or executables being shipped with the workflows in the way described in the Nextflow documentation. A similar mechanism exists for Snakemake as well, and probably other workflow engines, including support for executing these executables in the indicated containers. So by all means, if you fetch a workflow and expect to be able to run it, you will need to make sure to fetch files of type OTHER as well.

By the way, if we go the doc improvement route, we might use the opportunity to more clearly specify that files of type TEST_FILE can be JSON only (unless we want to extend this to include YAML as well). Otherwise people may think that, e.g., input files for tests (arguably also test files), that may very well be in binary format, should be TEST_FILEs, too.

coverbeck commented 1 year ago

Thanks for the thoughtful response @uniqueg .

There is no endpoint to allow fetching individual files of ToolFile.file_type OTHER

Ah, good point, I hadn't paid attention and wasn't aware of that. That does simplify things.

So then, I'm good with fetching via archive only and improving the doc.

A digression that this discussion about archives triggers for me... I suspect file permissions may be important in some cases for running the workflow and we may want to make tar an additional format option to handle that.

denis-yuen commented 1 year ago

A digression that this discussion about archives triggers for me... I suspect file permissions may be important in some cases for running the workflow and we may want to make tar an additional format option to handle that.

:thinking: