Reactors must support a "usage" command

TACC-Cloud / python-reactors

Python SDK for working with Abaco Actors

1 stars 0 forks source link

Reactors must support a "usage" command #8

Open mwvaughn opened 3 years ago

mwvaughn commented 3 years ago

Part of the work to define context and schema validation is to make it possible to provide a minimally helpful usage command for a given Reactor that can be called directly from the Docker image: docker run -it <reactor-image> usage.

mwvaughn commented 3 years ago

Here's a worked example of what such a usage command might look like. We assume here that the reactors package supports multiple context and message schemas.

USAGE
*****

docker run -it IMAGE reactor.py

VARIABLES
=========

Common
------

* MES : Message interpreted by reactor.py
* TAPIS_API_URL : URL of Tapis API server
* TAPIS_ACCESS_TOKEN : Oauth2 access token for Tapis API

Parameters
----------

Parameter variables can be set in the following combinations:

Context 1
~~~~~~~~~
meep : description
merp : description

Context 2
~~~~~~~~~
beep : description
boop : description
meep : description

Messages
--------

JSON values for MES must follow one of the following JSON schemas:

* /message_schemas/message.jsonschema
* /message_schemas/other.jsonschema

Configuration
-------------

Setting values for these variables will override the corresponding values in /config.yml

_REACTOR_LOG_LEVEL
_REACTOR_LOG_KEY
_REACTOR_OTHER_KEY
...

shwetagopaul92 commented 3 years ago

Should this also print the variables set in secrets.json while deploying the reactor?

mwvaughn commented 3 years ago

You are right, and those are implied by the Configuration section above.

From within the container, we don't actually know variables are set by the secrets.json mechanism. We only know the universe of possible variable names, which are derived from the namespace (_REACTOR_) plus the uppercased, underscore-delimited values of the first- and second-level keys in config.yml

mwvaughn commented 3 years ago

Here's an example of the "Configuration" section of a live-generated usage command.

Configuration
-------------

This function is configured via files found at:

  * /config.yml

The current union configuration is:

---
logger:
  client_key: F3VRMUNrPeaq84zp
  host: logger.sd2e.org
  path: /logger
  port: 31311
  proto: http
  uri: http://logger.sd2e.org:31311
logs:
  file: null
  level: DEBUG
  token: null
slack:
  channel: notifications
  webhook: null

First- or second level keys in this configuration can be overridden
by setting environment variables. These variables are supported:

  * _REACTOR_LOGGER_CLIENT_KEY
  * _REACTOR_LOGGER_HOST
  * _REACTOR_LOGGER_PATH
  * _REACTOR_LOGGER_PROTO
  * _REACTOR_LOGGER_URI
  * _REACTOR_LOGS_FILE
  * _REACTOR_LOGS_LEVEL
  * _REACTOR_LOGS_TOKEN
  * _REACTOR_SLACK_CHANNEL
  * _REACTOR_SLACK_WEBHOOK

Comments are welcome

mwvaughn commented 3 years ago

Here is the latest draft of a usage command output. This is the direct output of a working Reactor built using the current version of the code.

% python -m reactors.cli usage

USAGE
=====
This container image implements an Abaco function:
"This function prints HELLO WORLD using Reactor.logger"

It is runnable outside Abaco as follows:

  docker run -it --env var=val REPO

Parameters
----------

Abaco passes parameters into the function via URL parameters:

  curl -XPOST https://api.tacc.cloud/actors/v2/message?foo=bar

In this example, an environment variable 'FOO' will be set in the
container runtime with a value of 'bar'. To allow an Abaco function
to be run independently, this can be emulated by setting environment
variables when running the function container.

  docker run --env FOO=bar <container> <command>

A function developer may specify one or more valid sets of
parameters for use within the function. These parameter sets
can be validated or classified using built-in functions from
the Reactors module.

This function accepts the following environment variable sets:

Context schema.$id: RequiresUUID
File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/context_schemas/uuid.jsonschema
Parameters:
  * UUID desc: None; type: string; required: True

Context schema.$id: Default
File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/src/reactors/validation/context.jsonschema
Parameters:
  * MSG desc: Message received by the Actor; type: string; required: True
  * x-nonce desc: An Abaco nonce (API key); type: string; required: False

Please note that variable sets beyond 'Default' must also contain the
variables specified in 'Default', such as 'MSG'.

JSON Messages
-------------

Abaco accepts JSON-formatted messages that are transmitted to the
container runtime via the 'MSG' environment variable. They can,
in turn, be validated or classified using built-in methods from
the Reactors module.

  curl -XPOST -H "Content-Type: application/json" \
       -d '{"message": {"foo": "bar"}}' \
       https://api.tacc.cloud/actors/v2/messages

This function accepts and can validate JSON-formatted values
for 'MSG' that validate to the following JSON schemas:

  * Message schema.$id: AWS_SQS
    File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/message_schemas/sqs.jsonschema

  * Message schema.$id: file:///Users/mwvaughn/src/TACC-Cloud/python-reactors/message_schemas/email-noid.jsonschema
    File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/message_schemas/email-noid.jsonschema

  * Message schema.$id: DefaultJSON
    File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/src/reactors/validation/message.jsonschema

Configuration
-------------

The Reactor object provided by this SDK and usable within the function
is configured via files found at:

  * /Users/mwvaughn/src/TACC-Cloud/python-reactors/src/reactors/config.yml

If this current function utilizes this feature of the SDK, its
current configuration is:

---
logger:
  client_key: F3VRMUNrPeaq84zp
  host: logger.sd2e.org
  path: /logger
  port: 31311
  proto: http
  uri: http://logger.sd2e.org:31311
logs:
  file: null
  level: DEBUG
  token: null
slack:
  channel: notifications
  webhook: null

First- or second level keys in the configuration can be overridden
by setting environment variables at run time. The following
variables are supported:

  * _REACTOR_LOGGER_CLIENT_KEY
  * _REACTOR_LOGGER_HOST
  * _REACTOR_LOGGER_PATH
  * _REACTOR_LOGGER_PROTO
  * _REACTOR_LOGGER_URI
  * _REACTOR_LOGS_FILE
  * _REACTOR_LOGS_LEVEL
  * _REACTOR_LOGS_TOKEN
  * _REACTOR_SLACK_CHANNEL
  * _REACTOR_SLACK_WEBHOOK

Tapis Client
------------

This function may require an active Tapis client. One is automatically
provided by Abaco but can be injected at run time by providing either
a credentials file or setting environment variables.

Credentials File
~~~~~~~~~~~~~~~~

A Tapis client may be configured by volume mounting a credentials file:

  docker run -it -v ${HOME}/.agave:/root/.agave REPO

Environment Variables
~~~~~~~~~~~~~~~~~~~~~

A Tapis client may be configured by passing these variables:

  * TAPIS_BASE_URL - API server URL
  * TAPIS_TOKEN - Oauth2 access token

ethho commented 3 years ago

Thanks, Matt. What do you think about moving some (most) of this to online documentation? The idea is to remove all the content that does not change between different reactors, and refer users to the docs for details on how the SDK functions in general (and how they could develop their own reactors, extend/modify others' reactors).

% python -m reactors.cli usage

USAGE
=====

@@@@ "minimum viable" docker run command here @@@@

This container image implements an Abaco function:
"This function prints HELLO WORLD using Reactor.logger"

Parameters
----------

Please refer to https://tacc-cloud.github.io/python-reactors/usage/parameters for details.

This function accepts the following environment variable sets:

Context schema.$id: RequiresUUID
File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/context_schemas/uuid.jsonschema
Parameters:
  * UUID desc: None; type: string; required: True

Context schema.$id: Default
File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/src/reactors/validation/context.jsonschema
Parameters:
  * MSG desc: Message received by the Actor; type: string; required: True
  * x-nonce desc: An Abaco nonce (API key); type: string; required: False

JSON Messages
-------------

Please refer to https://tacc-cloud.github.io/python-reactors/usage/messages for details.

This function accepts and can validate JSON-formatted values
for 'MSG' that validate to the following JSON schemas:

  * Message schema.$id: AWS_SQS
    File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/message_schemas/sqs.jsonschema

  * Message schema.$id: file:///Users/mwvaughn/src/TACC-Cloud/python-reactors/message_schemas/email-noid.jsonschema
    File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/message_schemas/email-noid.jsonschema

  * Message schema.$id: DefaultJSON
    File: /Users/mwvaughn/src/TACC-Cloud/python-reactors/src/reactors/validation/message.jsonschema

Configuration
-------------

Please refer to https://tacc-cloud.github.io/python-reactors/usage/config for details.

The current configuration is:

---
logger:
  client_key: F3VRMUNrPeaq84zp
  host: logger.sd2e.org
  path: /logger
  port: 31311
  proto: http
  uri: http://logger.sd2e.org:31311
logs:
  file: null
  level: DEBUG
  token: null
slack:
  channel: notifications
  webhook: null

The following variables can be overridden by setting environment variables at runtime:

  * _REACTOR_LOGGER_CLIENT_KEY
  * _REACTOR_LOGGER_HOST
  * _REACTOR_LOGGER_PATH
  * _REACTOR_LOGGER_PROTO
  * _REACTOR_LOGGER_URI
  * _REACTOR_LOGS_FILE
  * _REACTOR_LOGS_LEVEL
  * _REACTOR_LOGS_TOKEN
  * _REACTOR_SLACK_CHANNEL
  * _REACTOR_SLACK_WEBHOOK

Tapis Client
------------

Please refer to https://tacc-cloud.github.io/python-reactors/usage/tapis for details.

A Tapis client may be configured by passing these variables:

  * TAPIS_BASE_URL - API server URL
  * TAPIS_TOKEN - Oauth2 access token

The thought here is that the person running the usage command is most likely a user/consumer of the given custom reactor, not a developer (this point is up for debate). If I were a user and didn't have access to Abaco runtime, the first things I'd want to see here would be:

Metadata - Who wrote this reactor and how do I annoy them with questions: description (as you have here), author & e-mail, GitHub repo link, version, other setup.py-like metadata, etc.
"Minimum viable" command to run this reactor locally - We could auto-generate the minimum viable list of -e VAR=value options that should be passed to the docker run command. I haven't fully fleshed out this idea, so I'm not sure what value would be. Maybe defaults? Populated from 'default' field in the schemas?
Other reactor-specific details - schamas and configs, as you have here

Thoughts?

mwvaughn commented 3 years ago

We definitely could move the large blocks of explanatory text to online docs - it will make the rendered page more succinct
Regarding metadata - This isn't feasible at present because we don't collect and store that metadata in the image. At one point, I thought about extending the project.ini format to include a general [metadata] section that would include fields such as author, help, license, etc. When the Docker image was built, these data could be included as tags. Unfortunately, this does not help very much because a containerized process cannot access those container image tags. I suppose we could just copy the contents of project.ini into the container at build time, though I am not supremely happy with that design.
Building a minimum viable command is harder than it looks at first glance because we support multiple contexts and multiple message schemas (in combination!). If we just picked the first context and first message schema, I suppose we could populate the environment variables with the default from the schemas. Ultimately, I think we can only get to probably working invocation unless we start capturing a lot more metadata at build time.

ethho commented 3 years ago

+1 on all three points here.

Replace the word "metadata" with "whatever metadata we can scrape from the container without much added effort". If at some point, we choose to add a [metadata] section, then we know where in the container-local CLI to put that metadata 😄
A generalized definition for "minimum viable command" is more of a goal/ideal than a tractable feature. Practically, this could flesh out as a succinct summary of the variables you will definitely need to provide in order to run locally, kind of like a function docstring:

Required environment variables
-------------------

* UUID (string) - No description provided for this variable. This variable is enforced by the schema: .Example: 34a1e1dc-a571-4f2b-9b62-d42d3b223059-007
x-nonce (string) - An Abaco nonce (API key)

NOTE: An asterisk (*) denotes required variables. There may be more required variables that are not listed above; please see reactor documentation for details.

Required message variables
-------------------

...and so on

This is similar to what you have under JSON Schemas. IFF we detect only one context.jsonschema and only one message.jsonschema (I suspect this will be the case for the vast majority of reactors), we expose any variables that are trivially parseable (strings, numbers, booleans). Despite my relatively limited knowledge of reactor use cases, I expect that most reactors will not use multiple context/message schemas, or implement anyOf-like behavior, especially if reactors are written atomically as we recommend. It's certainly important that we support these complex use cases in the SDK code, but for auto-generated documentation, we could just say The schema(s) enforced for this reactor are too complicated for us to parse here, please read the docs.

Okay that's enough rambling for this comment 😉

mwvaughn commented 3 years ago

After some prototyping, here's another go. I still have not implemented metadata, but I am able to generate a sensible run string for the case where there is 0-1 contexts and 0-1 message schemas.

For environment variables, I use the default from the context JSON schema, followed by the first value of examples, and fail with <type> if neither of those exist. For the JSON message, I am currently using hypothesis_jsonschema to render an example JSON document.

Regarding Hypothesis: The hypothesis_jsonschema package ignores default and examples in its faking strategy and is a little troubled by $ref elements that point to external URL. But, it is very good at generating from patternand format properties.

USAGE: This function prints HELLO WORLD using Reactor.logger

% docker run -it --env UUID="<string>" MSG='{"Key":"https://A.xfinity"}' <NAMESPACE/REPO:TAG>

Environment Variables
---------------------

  * UUID (string) - None [None]
  * MSG (string) - Message received by the Actor [None]

JSON Message
------------

The function accepts a JSON message (passed as MSG) conforming to schema:
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "AWS SQS",
  "$id": "AWS_SQS",
  "description": "An AWS-like SQS notification",
  "type": "object",
  "properties": {
    "Key": {
      "type": "string",
      "format": "uri",
      "description": "An object-store equivalent to a file path."
    }
  },
  "required": [
    "Key"
  ]
}

Example: {"Key":"https://A.xfinity"}

Tapis Client
------------

Please refer to https://tacc-cloud.github.io/python-reactors/usage/tapis for details.

A Tapis client may be configured by passing these variables:

  * TAPIS_BASE_URL - API server URL
  * TAPIS_TOKEN - Oauth2 access token

Configuration
-------------

Please refer to https://tacc-cloud.github.io/python-reactors/usage/config for details.

The current configuration is:

---
logger:
  client_key: F3VRMUNrPeaq84zp
  host: logger.sd2e.org
  path: /logger
  port: 31311
  proto: http
  uri: http://logger.sd2e.org:31311
logs:
  file: null
  level: DEBUG
  token: null
slack:
  channel: notifications
  webhook: null

Keys in this configuration can be overridden by setting environment
variables at run time. The following variables are supported:

  * _REACTOR_LOGGER_CLIENT_KEY
  * _REACTOR_LOGGER_HOST
  * _REACTOR_LOGGER_PATH
  * _REACTOR_LOGGER_PROTO
  * _REACTOR_LOGGER_URI
  * _REACTOR_LOGS_FILE
  * _REACTOR_LOGS_LEVEL
  * _REACTOR_LOGS_TOKEN
  * _REACTOR_SLACK_CHANNEL
  * _REACTOR_SLACK_WEBHOOK

ethho commented 3 years ago

This looks great IMO. My only question is why the JSON Schema section is using hypothesis but Environment Variables is not? Is this a constraint because of the way schemas are implemented/organized?

mwvaughn commented 3 years ago

At this point, it's just an experiment to see which approach works the best, and I can probably converge them when I refactor. I wanted to get the live-generated example out for comment.

ethho commented 3 years ago

Cool, I like the Environment Variables formatting more. Thanks, Matt!

shwetagopaul92 commented 3 years ago

This looks great. I was able to understand the usage command output easily, especially the example JSON message it prints out. This will make it easier for a user to very quickly figure out what he should be passing as the message.

For the rest of the JSON schema that is printed out apart from the propertiesand required, would the user need to know the other information on the schema as well?
If the reactor is hooked to a database, would that have to be set at the configuration level?

mwvaughn commented 3 years ago

Thanks for the feedback!

For question 1: We're just printing the preferred schema to the screen. We are forced to assume a little bit of familiarity with reading and interpreting JSONschema on the part of the user. Note that we do generate an Example JSON document though our ability to do is constrained by the quality and detail level of the schema. I would say that the documentation for this feature should include a worked example of building and validating a JSON document from a JSON schema since many users will not be that familiar with it.

For question 2: Configuration for a database connection would probably be specified by the developer of the Reactor using config.yml and the "Keys override" mechanism we leverage with the secrets.json file on the CLI. I imagine it might look something like so:

Configuration
-------------

Please refer to https://tacc-cloud.github.io/python-reactors/usage/config for details.

The current configuration is:

---
mongodb_uri: null
logger:
  client_key: F3VRMUNrPeaq84zp
...

Keys in this configuration can be overridden by setting environment
variables at run time. The following variables are supported:

  * _REACTOR_MONGODB_URI
  * _REACTOR_LOGGER_CLIENT_KEY
...

We don't have a good way to annotate the YAML configuration since PyYAML does not support comments. I think we just have to rely on well-named configuration keys in the config.yml file.

Now that I look at it, we might want to explicitly point out that the null values in the config need to be specified using environment variables. I would welcome suggested language for this.

shwetagopaul92 commented 3 years ago

Thanks a lot Matt.