aldbr / dirac-cwl-proto

GNU General Public License v3.0
0 stars 0 forks source link

Comments on the production example #5

Open arrabito opened 5 days ago

arrabito commented 5 days ago

I've looked at the implementation of the example of dirac production into CWL.

Here below some comments/questions about inputs/outputs description.

In this CWL production example the link between inputs and outputs of the steps of the workflow are described using input/output CWL features instead of input/output_query as in the current DIRAC. Does it mean that metadata would not be used at all? Or just that they would not be used for the link of 2 steps within the same workflow? To say it in a different way, how this implementation is supposed to cover the use case of a workflow which has to process some input data already present in the catalog (bkk)? Should one specify the input as a path rather than as an input_query? In my opinion it's important to keep the use of input_queries at least to specify the inputs of the 'parent-level' workflow step. This is in order to keep all the advantges of the data-driven behaviour of the TS (i.e. files dynamically added to the transformations as soon as they are available in the catalog). However, I think that the use of input/output_queries to link the steps of a workflow could be eventually abandoned in favour of CWL inputs/outputs. Taking the example of a simple workflow composed of 2 linked steps, i.e. simulation and processing, I see the following main advantages/disadvantages for the use of input/output_queries :

Advantages:

key1=value1
key1=value2

It makes possible to specify the inputs of the 2nd step as a subset of the outputs of the 1st step doing:

outputquery of the 1st step : key1 in (value1, value2)
inputquery of the 2nd step: key1 = value1

I guess that it's possible to obtain the same behaviour using CWL inputs/outputs but it seems clearer to me to use metadata instaed of file names or expressions.

Disadvantages:

Advantages/Disadvantages:

arrabito commented 5 days ago

Apart from the considerations in the previous comment I find that the proposed implementation has already most of the desired features.

arrabito commented 5 days ago

I also agree to separate the last step of the workflow doing data upload in a 'post_process' function.

aldbr commented 2 days ago

Does it mean that metadata would not be used at all? Or just that they would not be used for the link of 2 steps within the same workflow? To say it in a different way, how this implementation is supposed to cover the use case of a workflow which has to process some input data already present in the catalog (bkk)? Should one specify the input as a path rather than as an input_query?

After a discussion with @natthan-pigoux, I discovered my use case was too simple and that it would not work with workflow which has to process some input data already present in the catalog. So I am preparing an update that supports metadata :slightly_smiling_face:

The idea is the following:

# parameters provided to the Dirac client
dirac transformation submit <cwl path with its input inside the document>
# json submitted to the Dirac router
{
  "task": <cwl content>
  "description":
  {
    "type": "MCSimulation"
  }
}
# the Dirac router then relies on a `MCSimulationMetadata` class that knows how to construct an output query based on the input parameters
class MCSimulationMetadata(IMetadata):
    ...
    def get_output_query(self, output_name):
        if output_name == <MCSimulation output parameter name>:
            return <file catalog path>/<MCSimulation input value 1>/<MCSimulation input value 2>
# parameters provided to the Dirac client
dirac transformation submit <cwl path with its input inside the document> --metadata-path <metadata path>
# metadata content
group_size:
    - input1: 10
query_params:
    - <input param of MCSimulation 1>: <value1>
    - <input param of MCSimulation 2>: <value2>

# json submitted to the Dirac router
{
  "task": <cwl content>
  "description":
  {
    "type": "MCProcessing"
  }
  "metadata":
  {
    "group_size":
    {
      "input1": 10
    }
    "query_params":
    {
      "input param of MCSimulation 1>": <value1>,
      "input param of MCSimulation 2>": <value2>
    }
  }
}
# the Dirac router then relies on a `MCProcessingMetadata` class that knows how to construct an input query based on metadata and an output query based on the input parameters
class MCProcessingMetadata(IMetadata):
    ...
    def get_input_query(self, input_name):
        if input_name == <MCProcessing input parameter name>
            return MCSimulationMetadata(<metadata.query_params>).get_output_query(input_name)

    def get_output_query(self, output_name):
        if output_name == <MCProcessing output parameter name>:
            return <file catalog path>/<metadata.query_params value 1>/<metadata.query_params value 2>

Note: Dirac only supports linear workflows for now, whereas CWL is much more powerful and allows to create diamond workflows for instance, so I try to support the case where a transformation (i) can get inputs from multiple parents, which means one input query and group size per input parameter coming from a parent; (ii) can generate multiple outputs, one (or more) per child transformation, which means one output query per output parameter.