Open mwalker174 opened 11 months ago
Thank you, @mwalker174, for putting this together; this is extremely useful.
A few thoughts:
index is needed
Would it be safe to interpret this as the tool will fail if the index is not provided?
In the task-level inputs section, would it be accurate to assume the optional index can improve the performance if provided, otherwise, the tool will still work, but a bit slowly?! If so, maybe making it required for performance reasons could be a reasonable design decision to ensure the required input is provided for the best performance.
- If the file is an intermediate, meaning it is produced within the pipeline either within the workflow or by a previous module, then the index input should be omitted, and the presence of the index should be assumed to be ${FILE_URI}.tbi. This will cut down substantially on the number of inputs to keep track of, and shouldn't cause issues for users in the vast majority of use cases. There are less common scenarios where this could cause errors, such as a user manually copying files and forgetting to include indexes, but this is a power-user case and really not needed for running the pipeline as prescribed.
I think here we can run into a "perspective" issue: A file is "intermediate" if generated in a parent workflow and passed to a child workflow and not referenced in the output of the parent workflow; however, this file is not "intermediate" if the child workflow is executed independently from the parent workflow.
Many files types such as VCFs, CRAMS, BED, and evidence data files have companion index files that are sometimes required for fast retrieval over specific genomic intervals.
Overview:
Some WDLs require that the file be passed in explicitly. For example:
In some cases, its path is inferred from the file itself. For example,
And in other cases, the index file may not be declared at all.
Considerations:
There are several different scenarios, depending on a number of factors, that affect the decision about how to handle the indexes:
zcat
does not)${FILE_URI}.tbi
, and therefore an explicit declaration of the index is not needed.Other design factors to consider:
cp
,mv
, orln
) unless the particular tool being used supports passing in the index path explicitly (most don't). Usually we usecp
orln
, asmv
can create problems on shared filesystems.Proposal:
For the sake of development and UX, we should enforce consistent conventions regarding companion index files in the WDLs.
I'd propose the following rules to balance the above issues:
Task-level inputs:
localization_optional=true
will still localize files if unsupported by the backend.Workflow-level inputs:
File? file_index
) and documentation on its usage must be present.${FILE_URI}.tbi
. This will cut down substantially on the number of inputs to keep track of, and shouldn't cause issues for users in the vast majority of use cases. There are less common scenarios where this could cause errors, such as a user manually copying files and forgetting to include indexes, but this is a power-user case and really not needed for running the pipeline as prescribed.Task calls:
Note that explicitly declared indexes, from the workflow inputs or a previous task call, should be used instead when possible.
Outputs:
Resource storage:
gs://gcp-public-data--broad-references
.inputs/values/resources_hg38.json
to facilitate resource mirroring.Please feel free to discuss so we can finalize these conventions.