Feature Request - Load/store data/artifacts/binaries from external content source

nmaludy commented 5 years ago

SUMMARY

Currently in a distributed StackStorm deployment, when running an action the node that the action is run on is random. This causes some headaches when trying to deal with files or artifacts when implementing things like a ETL workflow or CI/CD workflow.

ETL:

Usually requires processing large amounts of data
Usually queries and persists data between multiple databases

CI/CD:

Usually works with local files
Usually produces some sort of "build artifact" (RPM, DEB, Docker container, etc)

The way this works now is:

ETL (Database query)

Write one action to query the database (SQL, NoSQL, etc)
Next action would take the output of the query as input and do the transform
Final action would take the transformed data and persist it back to the database
Problem: Data can be very large and saving this in Mongo can add a large tax on the workflow processing time

CI/CD (Files and Binaries)

Write a action(s) that create a build artifact(s) (RPM, DEB, Docker container, etc)
Write another set of actions that upload these build artifacts to a repository (Artifactory, Nexus, S3, etc)
Problem: The data is too large to passed between tasks, so we have to write it to a common NFS share and pass a file path between the the steps.

ISSUE TYPE

Feature Idea

IDEAS

Another workflow tool that i found has an interesting concept of Artifacts that it can be passed between steps in the workflow:

This spawned some thinking and relates to an idea i had in: https://github.com/StackStorm/st2/issues/4343

It would be cool if we could pass in "artifacts" as inputs/outputs associated with a task in a workflow. The task would perform some pre/post work to load/store the artifact around the action run.

Sudo coding it could look something like what i had in my other request.

ETL - Database

This would retrieve a database artifact from Mysql, do some processing, then publish the results back to Mysql.

vars:
  sql_connection: "{{ st2kv.system.sql_connection }}"

tasks:
  task1:
    action: transaction.place_orders
    input_artifact:
      mysql:
        connection: "{{ ctx().sql_connection }}"
        # by default this returns a list of dicts
        query: "SELECT id,name,date FROM orders ORDER BY date DESC;"
    input:
      data: "{{ input_artifact().mysql.result }}"
    next:
      - when: "{{ succeeded() }}"
        publish_artifact:
          mysql:
            connection: "{{ ctx().sql_connection }}"
            insert:
              # name of the table
              table: "history"
              # list of dicts to insert
              values: "{{ result() }}"

CI/CD - Files and Binaries

This would run a build process that checks out a git repo, builds the thing, uploads the RPM to a Yum repo and uploads the build log to an S3 bucket.

vars:
  sql_connection: "{{ st2kv.system.sql_connection }}"

tasks:
  build:
    action: cicd.build
    input_artifact:
      git:
        # downloads the repo to a local path on the actionrunner
        repo: https://github.com/org/repo.git
    input:
      path: "{{ input_artifact().git.path }}"
    next:
      - when: "{{ succeeded() }}"
        publish_artifact:
          nexus:
            path: "{{ result().rpm_path }}"
            upload: rpm
            url: "{{ st2kv.system.nexus.rpm_upload_url }}"
            username: "{{ st2kv.system.nexus.username }}"
            password: "{{ st2kv.system.nexus.password | decrypt_kv }}"
          s3:
            path: "{{ result().build_log_path }}"
            endpoint: storage.googleapis.com
            bucket: my-bucket-name
            key: path/in/my/bucket
            accesskey: "{{ st2kv.system.s3.accesskey | decrypt_kv }}"
            secretkey: "{{ st2kv.system.s3.accesskey | decrypt_kv }}"

Reusing existing packs

Ideally it would be great if packs could plugin to this "artifact" architecture and provide input/output artifact actions that could be run. This would allow us to have pluggability and not reinvent the wheel or have to pull in code complexity for integrations within StackStorm core itself.

Long story short, this is just a cool thing i saw and wanted to write down my thoughts / usecase before i forgot it.

cognifloyd commented 5 years ago

Cool idea. So, then maybe there would be a new artifact plugin where you can register actions in a pack as artifact handlers?

I'm planning to setup a pulp project server (v3) to host a bunch of artifacts like release archives, RPMs, and wheels. That will involve writing a new pack to include it in my workflows. So, if an artifact plugin registers actions, then maybe that would be all that is needed.

But, if we wanted some special handling for files, maybe a more direct integration with Pulp would be good for StackStorm. Pulp is written in python, and built as a distributed architecture. At a glance, maybe some of the pulp components/nodes could be added to StackStorm to provide an artifact repository for workflows.

cognifloyd commented 5 years ago

Plus, it would be nice for sensors to be able to have access to some kind of artifact repository too, so that the key-value store isn't the only officially supported way to store intermediate sensor data in between sensor polls.

arm4b commented 5 years ago

Noticed that interesting concept of passing artifacts within the workflow from Argo when we looked at it a few weeks ago. This is a good feature request and use cases listed make perfect sense too :+1:

guzzijones commented 4 years ago

I would also like this feature. I also like the publish_artifact idea. essentially write it to disk with the filename as a unique hash. Then store the hash in the keyvalue store linked to the original filename.

guzzijones commented 3 years ago

I am probably going to start on this at some point soon. This is the last remaining piece of st2 that i see missing for use cases on our end.

the client will need to be able to upload files to a storage location. The file can be given a unique hash and stored in the key value store as a lookup to the original file.
1. clients will need a special file input type that tells the client to upload to storage.
a key value pair can be saved in the data store and the key can be passed into the workflow for tasks to read the file.
also add a self.publish_file for python actions

cognifloyd commented 3 years ago

We might want to take inspiration from the pulp project (no not pulp2, pulp3) which uses the djangostorages framework under the covers. Then, such artifacts could be stored in whatever storage mechanism makes sense. eg azure blob storage, gcp storage, s3. or even nfs or for an all-in-one install, the local file system.

guzzijones commented 3 years ago

I found this undocumented feature to upload an ascii file at least. It solves my use case. use a @ in front of the parameter name. [file upload]https://github.com/StackStorm/st2/blob/911e2e16d7a356df1bb3992bb9d06829db36ab05/st2client/st2client/commands/action.py#L831

arm4b commented 3 years ago

@guzzijones Could you please document your findings in respective https://docs.stackstorm.com/ section?

rush-skills commented 3 years ago

I am not sure it serves the original purpose, but being able to have ETL like connection + query at the start of a workflow would also add the possibility of doing initial data lookups (to get the list of action-items for the workflow) from external databases, which would then reduce the overall data handled in input/output of actions/workflows (if people decide to offload the heavy bits), thus might contribute to a speedup. This does add a great benefit to our use case as well.

chris3081 commented 2 years ago

Not sure where this has gone, but I can see a use case for installing packs, in fact I have that exact use case myself. Is anyone working on this currently? If so I'd been keen to assist so I can retire the current hack I have for installing from s3/https in stackstorm k8s with shared volumes.

StackStorm / st2