Opinion request for BETYdb & Shapefile (geojson, etc) support with Sci

Chris-Schnaufer commented 4 years ago

Task to do What's the best way to have plot-clipping support in SciF use BETYdb or Shapefile (and its ilk, such as geojson)?

Reason There are several approaches to handling this, determining the tradeoffs

Result A working approach

Original question https://cyverse.slack.com/archives/DKV587DRU/p1592328675011800

I have a question for you: in your opinion, what's the best way to have plot-clipping support in SciF use BETYdb or Shapefile (and its ilk, such as geojson)?

For BETYdb the user needs to specify URL and KEY - typically through environment variables (security reasons)
For shapefile, geojson, etc. the filename needs to be specified on the command line Only one of these would be specified/used: specifying BETYdb variables along with a shapefile would cause the shapefile to be used

If no shapefile, and ilk, are specified, it's expected BETYdb variables would be specified Note: this is mostly coming about because shapefiles are specified on the plotclip command line

I have thought of 4 approaches to handling this with SciF docker:

Have separate apps (w/ jx-workflows) for each: one for BETYdb and one for file-based (potential for proliferation of files & duplication)
Have .jx file figure out if it needs to specify a parameter to plotclip, or not (not JX's strong point)
Have Scif recipe figure out (in shell) what to do depending upon what's specified (more complexity)
Have plotclip check for shapefile (and ilk) existence, or lack thereof, and default to BETYdb approach if not found

julianpistorius commented 4 years ago

@Chris-Schnaufer do you have a feeling for how difficult it would be to change the specific approach if we change our minds later?

Chris-Schnaufer commented 4 years ago

I don't think it's too hard to code the changes to any of these approaches. The time is spent testing them out. That's the big reason I'm leaning towards option 4. - the change is very lightweight and no other changes would be needed (aside from the initial tiny fixup of the JX file)

julianpistorius commented 4 years ago

My opinion is to:

Keep the individual SCIF apps as simple as possible (strike against 3)
Keep logic like this out of the transformers (it increases hidden complexity) (strike against 4)
Keep calls to external databases like BETYdb out of the main pipeline (strike against all four, or at least three and-a-half options)
Reduce duplication and file proliferation (strike against 1)

Most importantly: We should wrap a 'stateful' layer around a pure, deterministic core.

The benefits include: Increased reliability, performance, 'debuggability', and reproducibility.

We've already started addressing this very issue here: See https://github.com/AgPipeline/issues-and-projects/issues/29

I think we should do the equivalent of your solution of 'Option 1' in issue 29. See the steps under '(Draft Solution)':

An initialization step (pre-workflow-workflow?) to generate text (JSON) files or databases (sqlite or pglite)

Use these generated files as one of the inputs to the pipeline (along with current input like image files, etc.)

Modify steps which call out to external systems for input: Take the local files created above as their input

Modify steps which push results out to external systems: Produce local files as output

A finalization step (post-workflow-workflow?) to push the local generated result files to one or more external systems.

That would imply a modified version of your option 1 above. Implement the steps in the following order:

Make all the steps/apps/transformers which constitute the 'middle' or core of the pipeline operate exclusively on uniform, plain files.
Create a step/app which queries live databases and generates the appropriate input files for the core pipeline
Create a step/app which takes plain result files and pushes them to a live database

There are at least two ways to then combine these building blocks, and we can discuss those separately:

Separate pre- and post- workflows to generate input data and push output data
Sub-workflows which are completely skipped by Makeflow if there are already input files of the right name

Chris-Schnaufer commented 4 years ago

@dlebauer Are you OK with not including BETYdb in the workflows in the SciF docker container?

julianpistorius commented 4 years ago

Are you OK with not including BETYdb in the workflows in the SciF docker container?

To clarify, BETYdb (or any other dynamic data source) can be in a workflow, but it properly belongs at 'the edges' of the workflow, not in the middle.

Everything in the middle just deals with plain files.

Chris-Schnaufer commented 4 years ago

I'm not clear how this would work with automation. Are you suggesting that there's an app within the Scif container that fetches the BETYdb information, or the user gets the information beforehand in a potentially non-automated fashion, or something else?

julianpistorius commented 4 years ago

Are you suggesting that there's an app within the Scif container that fetches the BETYdb information...

Yes. Assuming somebody wants to build a pipeline which uses BETYdb for input data, there will be a SCIF app which grabs the data from BETYdb and creates plain files.

The BETYdb credentials could be passed to the SCIF app via environment variables or jx-args.json (I lean towards the latter.)

This SCIF app could be the first step in the Makeflow pipeline, specifying the plain files as output.

Any following steps in the pipeline should not know or care about BETYdb - they just deal with plain files.

This same mechanism also transparently supports this scenario:

...the user gets the information beforehand in a potentially non-automated fashion

If the correctly named, plain files are dropped in the right place, the first (BETYdb) step will just be skipped.

These plain files could come from:

A previous run which queried BETYdb and produced these files (i.e. BETYdb will not be queried again unless these files are removed manually or with makeflow --clean.)
A manual or semi-automatic process which pre-generates or pre-fetches input data from BETYdb or some other source
A data archive where pipeline creators can deposit pipeline artifacts and input data so other people can reproduce their workflow/publication without needing access to BETYdb

Chris-Schnaufer commented 4 years ago

How do you envision BETYdb getting called as part of the workflow? One workflow that handles shapefiles, geojson, etc, and another that calls BETYdb first as needed? If using only one workflow, where does the logic exist to call BETYdb, or not? There are multiple ways of doing this, I'm trying to understand that complete picture you have

julianpistorius commented 4 years ago

We might be using the same words to refer to different things. Let me try to ask some clarifying questions:

Is it possible to save BETYdb query results to a plain file?
If so, in the context of your original question ("plot-clipping"), what kind(s) of data would this plain file contain? (e.g. boundaries?)
What possible file formats can the BETYdb results be saved in? (e.g. CSV, JSON, GeoJSON, etc.)
Is the data contained in the BETYdb results file equivalent to (or a superset of) the kind of Shapefile or GeoJSON the pipeline user might choose to use (in the absence of BETYdb)?
Can the BETYdb results file be converted to the equivalent Shapefile or GeoJSON file?

Chris-Schnaufer commented 4 years ago

Yes, with some additional processing - it returns JSON
Yes, plot boundaries to clip against
JSON by default
Conversion is pretty straightforward - talking with David now, it should be geojson ( or other OGR compliant)
Yes

dlebauer commented 4 years ago

expect geojson
if shp or betydb is provided, first step is to convert to geojson
the geopandas library does the conversion https://stackoverflow.com/a/61697020/199217
ogr2ogr https://gis.stackexchange.com/a/300808/1239

julianpistorius commented 4 years ago

Great. Thanks! That makes sense.

Here is a rough proof-of-concept that demonstrates how to achieve what David describes:

https://github.com/julianpistorius/makeflow-subworkflow-example

It uses a single main workflow. The first step of the workflow is automatically selected from three different sub-workflows depending on the configured input type (GeoJSON, Shapefile, or Bety).

After the first step the remainder of the workflow only needs to care about GeoJSON.

Check out the configuration files:

jx-args-bety.json
jx-args-geojson.json
jx-args-shapefile.json

I hope that helps clarify what I meant.

julianpistorius commented 4 years ago

Update: Missing from this example is how to optionally push the final output to external databases like Bety, IRODS, etc.

This is what I would do to update the example to show this:

Move all the steps in workflow.jx except the first step into a new workflow called workflow-core.jx
Specify workflow-core.jx as a sub-workflow in the second step in workflow.jx that takes input.geojson as input
Add a third step to workflow.jx that does something with the final results produced by workflow-core.jx
This last step is also a sub-workflow, configurable using jx-args.json like the input workflow, e.g.
- workflow-output-bety.jx to post results to Bety
- workflow-output-s3.jx to upload to S3 bucket
- workflow-output-irods.jx to upload to CyVerse Data Store

This way you get the pure, deterministic, 'core' workflow which deals only with plain, static files (workflow-core.jx) wrapped in a stateful, dynamic, configurable workflow (workflow.jx) that is allowed to talk to the outside world in order to pull input and push output.

This also means it is possible to run the three workflow stages (input, core, output) independently (useful for development, testing, debugging, etc.)

julianpistorius commented 4 years ago

P.S. This workflow as described in comment above ☝️ would be a minimal implementation of the 'ideal' architecture for a scientific pipeline which I described eight months ago:

https://gist.github.com/julianpistorius/28950e4d60d194c198660ae843751b3b

dlebauer commented 4 years ago

@julianpistorius I think that this is a good idea but to stay on target w/ this issue might want to move discussion of architecture.. do workflows and sub-workflows differ in functionality with respect to job orchestration? i.e. can a main workflow distribute sub-workflows to different computing resources?

(perhaps minor point?) is that GET requests to metadata tables in BETYdb like sites don't require authentication.

Chris-Schnaufer commented 4 years ago

I think that the approach that Julian has suggested is enough to proceed with. We can refine the implementation of this approach as the solution is developed

julianpistorius commented 4 years ago

might want to move discussion of architecture.

Agreed.

do workflows and sub-workflows differ in functionality with respect to job orchestration?

No. Makeflow first creates a single workflow from the main workflow and any sub-workflows before executing it.

i.e. can a main workflow distribute sub-workflows to different computing resources?

Yes. In fact, each rule (step) in a workflow could theoretically be run on a different computing resource. This is handled in Makeflow by specifying resource requirements:

Resource requirements for a rule

You can create 'categories' which specify environment variables and resource requirements. Multiple rules can belong to each category. A category could be something like 'machine-learning' (for rules which need GPUs):

Categories

When Makeflow distributes jobs to Work Queue the jobs will be allocated to the appropriate Work Queue instance according to each job's requirements.

(perhaps minor point?) is that GET requests to metadata tables in BETYdb like sites don't require authentication.

Ah. Good to know. Thank you. That makes this particular case even simpler.

I'll leave it as a reference for how to query an external database which does need credentials. Getting images from private S3 bucket with a secret token for example.

dlebauer commented 4 years ago

I believe this is resolved - opinions are recorded. Next step is for @Chris-Schnaufer to summarize in documentation

julianpistorius commented 4 years ago

@Chris-Schnaufer says he will be going forward with a version of option 3:

Have Scif recipe figure out (in shell) what to do depending upon what's specified (more complexity)

This means a more complex SCIF app, and more complex (dynamic) runtime behavior (See Cyclomatic Complexity).

I'm not too worried about this as long as:

This complex (dynamic) behavior occurs only at the beginning of the pipeline
The rest of the pipeline has low cyclomatic complexity (no dynamic if-else forking logic)
The rest of the pipeline is 'static' and operates deterministically (plain input files, no calls to external databases)

To that end I still recommend applying the following pattern I mentioned:

Specify workflow-core.jx as a sub-workflow in the second step in workflow.jx that takes input.geojson as input

Move all the steps in workflow.jx except the first step into a new workflow called workflow-core.jx

Chris-Schnaufer commented 4 years ago

@Chris-Schnaufer Comment on the approach taking

Chris-Schnaufer commented 4 years ago

The approach we're taking is to have each action as a separate Makeflow workflow. It's expected that users of the resulting Docker image (or when not using a Docker image) will patch the steps together in a fashion that makes sense to them.

AgPipeline / issues-and-projects

Opinion request for BETYdb & Shapefile (geojson, etc) support with Sci #288