Open arrabito opened 5 days ago
Apart from the considerations in the previous comment I find that the proposed implementation has already most of the desired features.
I also agree to separate the last step of the workflow doing data upload in a 'post_process' function.
Does it mean that metadata would not be used at all? Or just that they would not be used for the link of 2 steps within the same workflow? To say it in a different way, how this implementation is supposed to cover the use case of a workflow which has to process some input data already present in the catalog (bkk)? Should one specify the input as a path rather than as an input_query?
After a discussion with @natthan-pigoux, I discovered my use case was too simple and that it would not work with workflow which has to process some input data already present in the catalog. So I am preparing an update that supports metadata :slightly_smiling_face:
The idea is the following:
MCSimulation
), then the CWL input values are used to construct the output query. Example:# parameters provided to the Dirac client
dirac transformation submit <cwl path with its input inside the document>
# json submitted to the Dirac router
{
"task": <cwl content>
"description":
{
"type": "MCSimulation"
}
}
# the Dirac router then relies on a `MCSimulationMetadata` class that knows how to construct an output query based on the input parameters
class MCSimulationMetadata(IMetadata):
...
def get_output_query(self, output_name):
if output_name == <MCSimulation output parameter name>:
return <file catalog path>/<MCSimulation input value 1>/<MCSimulation input value 2>
# parameters provided to the Dirac client
dirac transformation submit <cwl path with its input inside the document> --metadata-path <metadata path>
# metadata content
group_size:
- input1: 10
query_params:
- <input param of MCSimulation 1>: <value1>
- <input param of MCSimulation 2>: <value2>
# json submitted to the Dirac router
{
"task": <cwl content>
"description":
{
"type": "MCProcessing"
}
"metadata":
{
"group_size":
{
"input1": 10
}
"query_params":
{
"input param of MCSimulation 1>": <value1>,
"input param of MCSimulation 2>": <value2>
}
}
}
# the Dirac router then relies on a `MCProcessingMetadata` class that knows how to construct an input query based on metadata and an output query based on the input parameters
class MCProcessingMetadata(IMetadata):
...
def get_input_query(self, input_name):
if input_name == <MCProcessing input parameter name>
return MCSimulationMetadata(<metadata.query_params>).get_output_query(input_name)
def get_output_query(self, output_name):
if output_name == <MCProcessing output parameter name>:
return <file catalog path>/<metadata.query_params value 1>/<metadata.query_params value 2>
Note: Dirac only supports linear workflows for now, whereas CWL is much more powerful and allows to create diamond workflows for instance, so I try to support the case where a transformation (i) can get inputs from multiple parents, which means one input query and group size per input parameter coming from a parent; (ii) can generate multiple outputs, one (or more) per child transformation, which means one output query per output parameter.
I've looked at the implementation of the example of dirac production into CWL.
Here below some comments/questions about inputs/outputs description.
In this CWL production example the link between inputs and outputs of the steps of the workflow are described using input/output CWL features instead of input/output_query as in the current DIRAC. Does it mean that metadata would not be used at all? Or just that they would not be used for the link of 2 steps within the same workflow? To say it in a different way, how this implementation is supposed to cover the use case of a workflow which has to process some input data already present in the catalog (bkk)? Should one specify the input as a path rather than as an input_query? In my opinion it's important to keep the use of input_queries at least to specify the inputs of the 'parent-level' workflow step. This is in order to keep all the advantges of the data-driven behaviour of the TS (i.e. files dynamically added to the transformations as soon as they are available in the catalog). However, I think that the use of input/output_queries to link the steps of a workflow could be eventually abandoned in favour of CWL inputs/outputs. Taking the example of a simple workflow composed of 2 linked steps, i.e. simulation and processing, I see the following main advantages/disadvantages for the use of input/output_queries :
Advantages:
It makes possible to specify the inputs of the 2nd step as a subset of the outputs of the 1st step doing:
I guess that it's possible to obtain the same behaviour using CWL inputs/outputs but it seems clearer to me to use metadata instaed of file names or expressions.
Disadvantages:
Advantages/Disadvantages: