Comments on the production example

arrabito commented 5 days ago

I've looked at the implementation of the example of dirac production into CWL.

Here below some comments/questions about inputs/outputs description.

In this CWL production example the link between inputs and outputs of the steps of the workflow are described using input/output CWL features instead of input/output_query as in the current DIRAC. Does it mean that metadata would not be used at all? Or just that they would not be used for the link of 2 steps within the same workflow? To say it in a different way, how this implementation is supposed to cover the use case of a workflow which has to process some input data already present in the catalog (bkk)? Should one specify the input as a path rather than as an input_query? In my opinion it's important to keep the use of input_queries at least to specify the inputs of the 'parent-level' workflow step. This is in order to keep all the advantges of the data-driven behaviour of the TS (i.e. files dynamically added to the transformations as soon as they are available in the catalog). However, I think that the use of input/output_queries to link the steps of a workflow could be eventually abandoned in favour of CWL inputs/outputs. Taking the example of a simple workflow composed of 2 linked steps, i.e. simulation and processing, I see the following main advantages/disadvantages for the use of input/output_queries :

Advantages:

To give an example, if the first step produces data with different metadata, i.e.:

key1=value1
key1=value2

It makes possible to specify the inputs of the 2nd step as a subset of the outputs of the 1st step doing:

outputquery of the 1st step : key1 in (value1, value2)
inputquery of the 2nd step: key1 = value1

I guess that it's possible to obtain the same behaviour using CWL inputs/outputs but it seems clearer to me to use metadata instaed of file names or expressions.

Disadvantages:

The user has to know in advance the output_query which describes the output data of a step. To simplify this task to the user, in CTADIRAC the construction of the output_query is hidden to user and is done from the step parameters (specified by the user through a YML interface). One could think to do a similar thing starting from the cwl description. However it seems to me quite difficult to do a generic implementation for this, since it requires to know which of the step parameters are relevant to build the output_query.

Advantages/Disadvantages:

The 2nd step of the workflow can take as inputs data that are produced by another workflow. In CTA we haven't encountered this use case yet. On the contrary, most of the times this is an unwanted behaviour, since the user expects a workflow being somehow 'isolated' from others (apart for the 'parent-level' step of the workflow for which it seems normal to consume data produced by other workflows). Finally an eventual advantage is that inputs and treated in the same manner, i.e. through input_query, for all the steps of a workflow.

arrabito commented 5 days ago

Apart from the considerations in the previous comment I find that the proposed implementation has already most of the desired features.

arrabito commented 5 days ago

I also agree to separate the last step of the workflow doing data upload in a 'post_process' function.

aldbr commented 2 days ago

Does it mean that metadata would not be used at all? Or just that they would not be used for the link of 2 steps within the same workflow? To say it in a different way, how this implementation is supposed to cover the use case of a workflow which has to process some input data already present in the catalog (bkk)? Should one specify the input as a path rather than as an input_query?

After a discussion with @natthan-pigoux, I discovered my use case was too simple and that it would not work with workflow which has to process some input data already present in the catalog. So I am preparing an update that supports metadata :slightly_smiling_face:

The idea is the following:

if the transformation is submitted without metadata (e.g. MCSimulation), then the CWL input values are used to construct the output query. Example:

# parameters provided to the Dirac client
dirac transformation submit <cwl path with its input inside the document>
# json submitted to the Dirac router
{
  "task": <cwl content>
  "description":
  {
    "type": "MCSimulation"
  }
}
# the Dirac router then relies on a `MCSimulationMetadata` class that knows how to construct an output query based on the input parameters
class MCSimulationMetadata(IMetadata):
    ...
    def get_output_query(self, output_name):
        if output_name == <MCSimulation output parameter name>:
            return <file catalog path>/<MCSimulation input value 1>/<MCSimulation input value 2>

if the transformation is submitted with metadata, then the metadata are used to build the input query to get the input data.

# parameters provided to the Dirac client
dirac transformation submit <cwl path with its input inside the document> --metadata-path <metadata path>
# metadata content
group_size:
    - input1: 10
query_params:
    - <input param of MCSimulation 1>: <value1>
    - <input param of MCSimulation 2>: <value2>

# json submitted to the Dirac router
{
  "task": <cwl content>
  "description":
  {
    "type": "MCProcessing"
  }
  "metadata":
  {
    "group_size":
    {
      "input1": 10
    }
    "query_params":
    {
      "input param of MCSimulation 1>": <value1>,
      "input param of MCSimulation 2>": <value2>
    }
  }
}
# the Dirac router then relies on a `MCProcessingMetadata` class that knows how to construct an input query based on metadata and an output query based on the input parameters
class MCProcessingMetadata(IMetadata):
    ...
    def get_input_query(self, input_name):
        if input_name == <MCProcessing input parameter name>
            return MCSimulationMetadata(<metadata.query_params>).get_output_query(input_name)

    def get_output_query(self, output_name):
        if output_name == <MCProcessing output parameter name>:
            return <file catalog path>/<metadata.query_params value 1>/<metadata.query_params value 2>

Note: Dirac only supports linear workflows for now, whereas CWL is much more powerful and allows to create diamond workflows for instance, so I try to support the case where a transformation (i) can get inputs from multiple parents, which means one input query and group size per input parameter coming from a parent; (ii) can generate multiple outputs, one (or more) per child transformation, which means one output query per output parameter.

aldbr / dirac-cwl-proto

Comments on the production example #5