Closed Chris-Schnaufer closed 4 years ago
@Chris-Schnaufer do you have a feeling for how difficult it would be to change the specific approach if we change our minds later?
I don't think it's too hard to code the changes to any of these approaches. The time is spent testing them out. That's the big reason I'm leaning towards option 4. - the change is very lightweight and no other changes would be needed (aside from the initial tiny fixup of the JX file)
My opinion is to:
Most importantly: We should wrap a 'stateful' layer around a pure, deterministic core.
The benefits include: Increased reliability, performance, 'debuggability', and reproducibility.
We've already started addressing this very issue here: See https://github.com/AgPipeline/issues-and-projects/issues/29
I think we should do the equivalent of your solution of 'Option 1' in issue 29. See the steps under '(Draft Solution)':
- An initialization step (pre-workflow-workflow?) to generate text (JSON) files or databases (sqlite or pglite)
- Use these generated files as one of the inputs to the pipeline (along with current input like image files, etc.)
- Modify steps which call out to external systems for input: Take the local files created above as their input
- Modify steps which push results out to external systems: Produce local files as output
- A finalization step (post-workflow-workflow?) to push the local generated result files to one or more external systems.
That would imply a modified version of your option 1 above. Implement the steps in the following order:
There are at least two ways to then combine these building blocks, and we can discuss those separately:
@dlebauer Are you OK with not including BETYdb in the workflows in the SciF docker container?
Are you OK with not including BETYdb in the workflows in the SciF docker container?
To clarify, BETYdb (or any other dynamic data source) can be in a workflow, but it properly belongs at 'the edges' of the workflow, not in the middle.
Everything in the middle just deals with plain files.
I'm not clear how this would work with automation. Are you suggesting that there's an app within the Scif container that fetches the BETYdb information, or the user gets the information beforehand in a potentially non-automated fashion, or something else?
Are you suggesting that there's an app within the Scif container that fetches the BETYdb information...
Yes. Assuming somebody wants to build a pipeline which uses BETYdb for input data, there will be a SCIF app which grabs the data from BETYdb and creates plain files.
The BETYdb credentials could be passed to the SCIF app via environment variables or jx-args.json
(I lean towards the latter.)
This SCIF app could be the first step in the Makeflow pipeline, specifying the plain files as output.
Any following steps in the pipeline should not know or care about BETYdb - they just deal with plain files.
This same mechanism also transparently supports this scenario:
...the user gets the information beforehand in a potentially non-automated fashion
If the correctly named, plain files are dropped in the right place, the first (BETYdb) step will just be skipped.
These plain files could come from:
makeflow --clean
.)How do you envision BETYdb getting called as part of the workflow? One workflow that handles shapefiles, geojson, etc, and another that calls BETYdb first as needed? If using only one workflow, where does the logic exist to call BETYdb, or not? There are multiple ways of doing this, I'm trying to understand that complete picture you have
We might be using the same words to refer to different things. Let me try to ask some clarifying questions:
expect geojson
if shp or betydb is provided, first step is to convert to geojson
the geopandas library does the conversion https://stackoverflow.com/a/61697020/199217
Great. Thanks! That makes sense.
Here is a rough proof-of-concept that demonstrates how to achieve what David describes:
https://github.com/julianpistorius/makeflow-subworkflow-example
It uses a single main workflow. The first step of the workflow is automatically selected from three different sub-workflows depending on the configured input type (GeoJSON, Shapefile, or Bety).
After the first step the remainder of the workflow only needs to care about GeoJSON.
Check out the configuration files:
I hope that helps clarify what I meant.
Update: Missing from this example is how to optionally push the final output to external databases like Bety, IRODS, etc.
This is what I would do to update the example to show this:
workflow.jx
except the first step into a new workflow called workflow-core.jx
workflow-core.jx
as a sub-workflow in the second step in workflow.jx
that takes input.geojson
as inputworkflow.jx
that does something with the final results produced by workflow-core.jx
jx-args.json
like the input workflow, e.g.
workflow-output-bety.jx
to post results to Betyworkflow-output-s3.jx
to upload to S3 bucketworkflow-output-irods.jx
to upload to CyVerse Data StoreThis way you get the pure, deterministic, 'core' workflow which deals only with plain, static files (workflow-core.jx
) wrapped in a stateful, dynamic, configurable workflow (workflow.jx
) that is allowed to talk to the outside world in order to pull input and push output.
This also means it is possible to run the three workflow stages (input, core, output) independently (useful for development, testing, debugging, etc.)
P.S. This workflow as described in comment above ☝️ would be a minimal implementation of the 'ideal' architecture for a scientific pipeline which I described eight months ago:
https://gist.github.com/julianpistorius/28950e4d60d194c198660ae843751b3b
@julianpistorius I think that this is a good idea but to stay on target w/ this issue might want to move discussion of architecture.. do workflows and sub-workflows differ in functionality with respect to job orchestration? i.e. can a main workflow distribute sub-workflows to different computing resources?
(perhaps minor point?) is that GET requests to metadata tables in BETYdb like sites don't require authentication.
I think that the approach that Julian has suggested is enough to proceed with. We can refine the implementation of this approach as the solution is developed
might want to move discussion of architecture.
Agreed.
do workflows and sub-workflows differ in functionality with respect to job orchestration?
No. Makeflow first creates a single workflow from the main workflow and any sub-workflows before executing it.
i.e. can a main workflow distribute sub-workflows to different computing resources?
Yes. In fact, each rule (step) in a workflow could theoretically be run on a different computing resource. This is handled in Makeflow by specifying resource requirements:
You can create 'categories' which specify environment variables and resource requirements. Multiple rules can belong to each category. A category could be something like 'machine-learning' (for rules which need GPUs):
When Makeflow distributes jobs to Work Queue the jobs will be allocated to the appropriate Work Queue instance according to each job's requirements.
(perhaps minor point?) is that GET requests to metadata tables in BETYdb like sites don't require authentication.
Ah. Good to know. Thank you. That makes this particular case even simpler.
I'll leave it as a reference for how to query an external database which does need credentials. Getting images from private S3 bucket with a secret token for example.
I believe this is resolved - opinions are recorded. Next step is for @Chris-Schnaufer to summarize in documentation
@Chris-Schnaufer says he will be going forward with a version of option 3:
- Have Scif recipe figure out (in shell) what to do depending upon what's specified (more complexity)
This means a more complex SCIF app, and more complex (dynamic) runtime behavior (See Cyclomatic Complexity).
I'm not too worried about this as long as:
To that end I still recommend applying the following pattern I mentioned:
- Specify
workflow-core.jx
as a sub-workflow in the second step inworkflow.jx
that takesinput.geojson
as input- Move all the steps in
workflow.jx
except the first step into a new workflow calledworkflow-core.jx
@Chris-Schnaufer Comment on the approach taking
The approach we're taking is to have each action as a separate Makeflow workflow. It's expected that users of the resulting Docker image (or when not using a Docker image) will patch the steps together in a fashion that makes sense to them.
Task to do What's the best way to have plot-clipping support in SciF use BETYdb or Shapefile (and its ilk, such as geojson)?
Reason There are several approaches to handling this, determining the tradeoffs
Result A working approach
Original question https://cyverse.slack.com/archives/DKV587DRU/p1592328675011800
I have a question for you: in your opinion, what's the best way to have plot-clipping support in SciF use BETYdb or Shapefile (and its ilk, such as geojson)?
If no shapefile, and ilk, are specified, it's expected BETYdb variables would be specified Note: this is mostly coming about because shapefiles are specified on the plotclip command line
I have thought of 4 approaches to handling this with SciF docker: