Open coverbeck opened 1 year ago
Good point. But I would like to point out that the current specs are at least consistent in that binary files appear to not be supported on purpose:
ToolFile.file_type
choices, all except for OTHER
should generally (always?) be in human-readible format for the supported workflow types.ToolFile.file_type
OTHER
, whereas such endpoints exist for the other types, specifically:
PRIMARY_DESCRIPTOR
: GET /tools/{id}/versions/{version_id}/{type}/descriptor
SECONDARY_DESCRIPTOR
: GET /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path}
TEST_FILE
: GET /tools/{id}/versions/{version_id}/{type}/tests
CONTAINERFILE
: GET /tools/{id}/versions/{version_id}/containerfile
FileWrapper
schema is only used in the endpoints mentioned above, i.e., there is no such annotation to be retrieved for types of file OTHER
, binary or not.FileWrapper
schema seems to suggest that it is only defined for ToolFile.file_type
s other than OTHER
:
A file provides content for one of - A tool descriptor is a metadata document that describes one or more tools. - A tool document that describes how to test with one or more sample test JSON. - A containerfile is a document that describes how to build a particular container image. Examples include Dockerfiles for creating Docker images and Singularity recipes for Singularity images
PLAIN_
types are only described for descriptors in schema DescriptorTypeWithPlain
, and text/plain
content types are only (though somewhat inconsistently, see https://github.com/ga4gh/tool-registry-service-schemas/issues/162 & https://github.com/ga4gh/tool-registry-service-schemas/issues/229) defined for responses in endpoints that return human-readable files. (Personally though, I anyway think that this is quite a messy area of the specs, and I would strongly support getting rid of the PLAIN_
types altogether, as argued here: https://github.com/ga4gh/tool-registry-service-schemas/issues/210)GET /tools/{id}/versions/{version_id}/{type}/files
endpoint to retrieve all files subtly implies in its description that the use of the endpoint with the application/json
format is meant to be used for retrieving workflow descriptors only:
Get a list of objects that contain the relative path and file type. The descriptors are intended for use with the /tools/{id}/versions/{version_id}/{type}/descriptor/{relative_path} endpoint. Returns a zip file of all files when format=zip is specified.
So with this in mind, let me address your points:
What to do if one of the PLAIN types, e.g., PLAIN_WDL is requested? An error response, or return the data with the content-type set accordingly?
I may be wrong, but I know of no workflow engine supporting the execution of workflows defined in one of the currently enumerated 'DescriptorType
s that allows executing workflows described in binary format. And even if, that would probably constitute highly engine-specific behavior. So if we were able to pull such binaries, how could we be sure it's in a format that a given engine supports. Shouldn't TRS then specify the binarization mechanism (e.g., a compression type) to be useful? Of course the case may be different for other workflow types that are currently not explicitly supported by TRS (i.e., not enumerated in the 'DescriptorType
schema).
A different idea is that the only way to fetch binary data is by fetching the workflow as a bundle. In practice, I think most clients should be using the zip endpoint, but it does seem incomplete to only be able to fetch and know about binaries this way.
While I generally agree that there is an inconsistency here, the inability to fetch individual files is not restricted to binary files only, as described above. No file of ToolFile.file_type
OTHER
can currently be retrieved individually, outside of the GET /tools/{id}/versions/{version_id}/{type}/files
endpoint with content type application/zip
.
So, personally, I think we should not worry about supporting binary files unless we have a real-world use case that demonstrates a strong need for binary descriptor, test or container recipe files. Otherwise the route of fetching a ZIP archive of all files seems a perfectly acceptable way of dealing with files of any type, including binary, to me. Perhaps we could make it clearer though, both in the documentation and the specs, that clients are expected to make use of this mechanism if they want to fetch a workflow for actual execution, and that the other endpoints are more designed for allowing quick introspection of workflow descriptors, container recipes etc.
I may be wrong, but I know of no workflow engine supporting the execution of workflows defined in one of the currently enumerated
'DescriptorType
s that allows executing workflows described in binary format. And even if, that would probably constitute highly engine-specific behavior.
We're pondering Nextflow which has the option of binary files https://www.nextflow.io/docs/latest/faq.html#how-do-i-invoke-custom-scripts-and-tools
Otherwise the route of fetching a ZIP archive of all files seems a perfectly acceptable way of dealing with files of any type, including binary, to me. Perhaps we could make it clearer though, both in the documentation and the specs, that clients are expected to make use of this mechanism if they want to fetch a workflow for actual execution, and that the other endpoints are more designed for allowing quick introspection of workflow descriptors, container recipes etc.
That said, the zip archive/improving documentation route makes sense to me
But this is not about descriptor files @denis-yuen. Surely workflows in the wild rely on scripts or executables being shipped with the workflows in the way described in the Nextflow documentation. A similar mechanism exists for Snakemake as well, and probably other workflow engines, including support for executing these executables in the indicated containers. So by all means, if you fetch a workflow and expect to be able to run it, you will need to make sure to fetch files of type OTHER
as well.
By the way, if we go the doc improvement route, we might use the opportunity to more clearly specify that files of type TEST_FILE
can be JSON only (unless we want to extend this to include YAML as well). Otherwise people may think that, e.g., input files for tests (arguably also test files), that may very well be in binary format, should be TEST_FILE
s, too.
Thanks for the thoughtful response @uniqueg .
There is no endpoint to allow fetching individual files of ToolFile.file_type OTHER
Ah, good point, I hadn't paid attention and wasn't aware of that. That does simplify things.
So then, I'm good with fetching via archive only and improving the doc.
A digression that this discussion about archives triggers for me... I suspect file permissions may be important in some cases for running the workflow and we may want to make tar
an additional format option to handle that.
A digression that this discussion about archives triggers for me... I suspect file permissions may be important in some cases for running the workflow and we may want to make
tar
an additional format option to handle that.
:thinking:
A workflow could have binary files, such as compiled code, images, etc.
TRS does not have a way to serve up binaries; the FileWrapper object returns a JSON object with a String content field -- if the data were binary, it would need to be escaped.
Some brainstorming:
content
field is encoded or not, so the client will know whether to decode it.PLAIN
types, e.g.,PLAIN_WDL
is requested? An error response, or return the data with the content-type set accordingly?┆Issue is synchronized with this Jira Story ┆Project Name: Zzz-ARCHIVE GA4GH tool-registry-service ┆Issue Number: TRS-64