VForWaTer / tool-specs

Repository to manage issues and discussions regarding the entire [toolbox-runner](https://github.com/hydrocode-de/tool-runner) stack.
https://vforwater.github.io/tool-specs
MIT License
0 stars 0 forks source link

Add array or regex to data #23

Open mmaelicke opened 3 weeks ago

mmaelicke commented 3 weeks ago

In the data section, there might be a case, in which not only a single file is associated to an input dataset, but a list of files. In these cases we can either:

  1. allow an array similar to the parameters
  2. allow regular expressions as an attribute inside the data section

In both cases, the specs cannot handle cases in which the number of datasets is arbitrary. In these cases, the developer has to fall back to specify a directory in the parameters instead of data section.

An example for the multi-files case: A tool takes a netCDF, which is chunked into many files

An example for the multi-dataset (multi-files): An aggregator or viewer tool takes a folder as input, that contains data folders. Similar to what the data loader creates. I would argue, that this is an edge case and usually tools can specify the data they need.

I am in preference of setting ie. a multi=True flag on a data spec, which effectively allows wildcards in the path

@Ash-Manoj @AlexDo1 do you have any comments on this? I am not entirely sure how to do that and comments are welcome

AlexDo1 commented 3 weeks ago

Hm, good question.

I like the multi flag, as this also quite clearly states that there can be more than one data file. Just always allowing wildcards could be confusing, as it would not be clear via the specification if multiple data files are allowed.

At the moment I'm also in favor of allowing wildcards then, as this allows to be stricter in defining the file names (e.g. in/precipitation/preciptitation_*.nc for in/precipitation/preciptitation_2011.nc, in/precipitation/preciptitation_2012.nc, in/precipitation/preciptitation_2013.nc.
But the wildcard also would allow to just take everything inside a folder as input data, even when the file names are not that structured, e.g. in/data/* for in/data/air_temperature.nc, in/data/discharge.csv, in/data/catchment.geojson (would probably be bad implementation to have that as input data, but I think it demonstrates what I mean).

So I like the flexibility of the wildcard together with the clarity of the multi flag.

Ash-Manoj commented 3 weeks ago

I also like the flag idea. We could test this on the catflow generator tool where I think multiple tiff files have to be read in as input for the tool.