Data.all CLI and SDK - Githubissues

Description:

Introduce a command-line interface (CLI) and a software development kit (SDK) tool for data.all and publish these tools as separate packages through a publishing pipeline to PyPI. This feature aims to provide users with a seamless and efficient means of interfacing with data.all programmatically, allowing for enhanced automation, scripting, and integration into existing workflows. This not only facilitates ease of distribution but also enhances version control, making it simpler for users to incorporate and manage the tools in their environments.The CLI/SDK tool should encompass a comprehensive set of commands and functions, enabling users to perform operations related to data.all features such as datasets, dataset share, environment, organization and consumption role through programmatic interfaces. This enhanced CLI/SDK tool, coupled with PyPI publishing capabilities would significantly improve the overall user experience, empowering developers and administrators to leverage data.all functionalities in a programmatic manner.

Details:

Command-Line Interface (CLI):

A user-friendly command-line interface offering a set of commands for core data.all operations.
Support for scripting and automation, allowing users to seamlessly integrate data.all interactions into their workflows

Software Development Kit (SDK):

A comprehensive SDK providing programmatic access to core data.all functionalities.
Libraries for popular programming languages, with a focus on Python for seamless integration.

PyPI Publishing Component:

CLI/SDK Packaging Automation: Develop a automated pipeline that automates the process of packaging the dataall CLI and SDK. This should include the creation of Python packages compatible with PyPI standards.
Version Management: Implement version management within the automated pipeline to allow for easy tracking and updating of the dataall CLI/SDK tool.
Dependency Resolution: Ensure that the automated pipeline resolves and includes any required dependencies or libraries for the CLI/SDK tool, simplifying installation for users.
Documentation Integration: Integrate documentation generation into the pipeline, generating documentation for the dataall CLI/SDK and making it available on platforms such as ReadTheDocs.
Testing and Validation: Implement testing and validation steps in the pipeline to ensure the packaged dataall CLI/SDK works as expected across different environments and scenarios.
Release Automation: Automate the release process to PyPI, streamlining the release workflow.

Benefits:

Automated Distribution: Facilitates easy distribution of CLI/SDK tools through PyPI, streamlining installation for users.
Automation and Scripting: Empowers users to automate routine tasks and scripting of complex workflows using the CLI. Enables seamless integration of data.all operations into existing automation pipelines.
Version Control and Compatibility: Enables versioned releases, ensuring users can manage and select specific tool versions based on compatibility and requirements.
Enhanced Accessibility: Improves accessibility by making the CLI/SDK tools readily available through a widely used package repository like PyPI.
Improved Efficiency: Reduces manual effort and enhances efficiency by allowing users to perform data.all tasks programmatically.

@noah-paige

Could be useful.. Especially for those who want to run scripts to do bunch of things like import a lot of datasets etc. Today when I want to validate things I'm querying RDS directly and it would be nice not to have to do so and query via CLI or APIs. I would say authentication is a big one to figure out here..... especially for IDP users. For example we use OKTA and I'm not even sure that I can register headless users in our IdP. Definitely think about that how headless users will get access who are not just using pure Cognito.

Appreciate the feedback @zsaltys. This feature has been requested frequently and has been implemented by some in different capacities. We believe it adds value to various use cases, including the ones you highlighted. I agree that examining authentication aspects will be crucial. We intend to address this initiative in a phased approach. As we delve deeper and gain more insights, we'll use this issue/PR to document and manage feedback and suggestions.

Design Considerations

Repository Location

For the data.all SDK/CLI, we will host the SDK and CLI packages within a new repository (separate from data.all OS but within the same GitHub Org) with release cycles for each of the respective packages (repo name TBD)

A separate repository to host this code will provide the following benefits:

Physical Isolation of different repositories that are serving different users/purposes (application vs. tooling)
Separate versioning and release cycles between application code and SDK/CLI utility
Independent documentation, examples, etc. focused on upskilling the community on how to use these libraries specifically

Repository Structure

GraphQL Schema File: The key information that powers the SDK/CLI utility will the be GQL schema file which defines all of the supported API requests (i.e. queries and mutations) along with input and output types for those requests. We will leverage this file to dynamically generate the me

For reference in data.all OS today, we build this GQL schema already in the api_handler.py similar to the following code:

from dataall.base.api import bootstrap as bootstrap_schema, get_executable_schema
from graphql import print_schema
schema = get_executable_schema()
print(print_schema(schema))

Repo Layout The below is a high-level look at how the SDK/CLI data.all programmatic tools library will be organized:

schema/
    schema.graphql         # ---> GQL Schema File
    ...
dataall_core/
    loader.py              # ---> Load Schema GQL File
    dataall_client.py      # ---> Generate Methods Dynamically
    base_client.py         # ---> Class Where Methods Attached
    auth/                  # ---> Auth Classes (i.e. Cognito, Okta, etc.)
    ...
api/
    __init__.py
    Client.py
    utils/
        __init__.py
        stacks_utils.py
    ...
cli/
    __init__.py
    Client.py
    ...

The above file layout follows similar design principals to other widely used SDK/CLI tooling (such as botocore, boto3 and awscli). A more in-depth description of the above file structure follows:

schema/ → GraphQL Schema File
dataall_core/ → Base Client that generates the classes and methods for all of the queries / mutations
- Generates Query Strings
- Determines Input Variables and Handles Output Responses
- Interacts with API Endpoint
- Includes Auth Interface for different modes (i.e. Cognito, Okta, etc.) (...discussed in more detail below)
api/
- Any SDK Specific Configuration Built on top of dataall_core
  - May include added features such as paginators, waiters, etc.
- High-Level Abstractions → More user-friendly and Pythonic interface for interacting with AWS services.
- Session Management & Configuration → Making it easier to work with multiple data.all Users
cli/
- Dynamic creation of CLI Commands mapped back to dataall_core API
- Any CLI Specific Configuration Built on top of dataall_core

Authentication Flows

Configuration & Profiles

CLI Command dataall_cli configure to set up client - provide all user information (i.e. username, password, client_id, endpoint_url, and region)

Similar to aws configure

dataall_cli client configure \
--username  "dataall-username" \
--password  "********" \
--client_id "********" \
--url       "https://******.execute-api.<AWS_REGION>.amazonaws.com/prod/" \
--region    "<AWS_REGION>" \
--profile   "dataall-team";

Storing Credentials Locally
- User information saved in ~/.dataall_cli/config.yaml with password base64 encoded
- When Client() initialized
  - Token retrieved by calling cognito_client.initiate_auth (**kwargs) with user information
  - Saved token at ~/.dataall_cli/{DATAALL_USERNAME}.cfg
  - Pass this token as header to API Endpoint for all future requests and refresh if expired
  - Defaults to default profile unless specified

Running CLI

dataall_cli list_organizations --profile dataall-team

Running SDK

da_client = dataall.client(profile="dataall-team")
list_org_response = da_client.list_organizations()

Supporting Custom Auth (i.e. Okta, ...)

Introduce an optional client-level parameter of auth_strategy (default: ‘Cognito’) for custom auth data.all deployments
- Requires additional specification of authorization server endpoints (i.e. authorize, token, etc.)
- Retrieval of tokens by using OAuth2.0 Flow (instead of abstraction using built-in Cognito API methods )
```
dataall_cli list_organizations --profile dataall-team --auth-strategy Custom
```

da_client = dataall.client(profile="dataall-team", auth_strategy="Custom")

Supporting AWS Compute Services (i.e. Lambda, Glue, etc.)

Leverage AWS Secrets Manager to store configuration details in AWS Account
AWS Compute Services Considerations
- Pre-requisite: User creates secret in AWS Account containing user information(profile, username, password, client_id, region, url)
- Introduce an optional parameter to point dataall Client to secret_arn instead of configuring client locally
- Ex: Using SDK in Lambda

import dataall_sdk

def lambda_handler(event, context):
    # Set Up DA Client - UserA    
    da_client = dataall_sdk.client(profile="dataall-team-lambda", secret_arn="<SECRET_ARN>")

    # # List Org - Ensure Org Creatd
    list_org_response = da_client.list_organizations()
    print(list_org_response)

Certainly , great feature to have. Q. As there are plans to use AppSync for GQL , will the also seemlessly work with AppSync ?

Certainly , great feature to have. Q. As there are plans to use AppSync for GQL , will the also seemlessly work with AppSync ?

Yes! It will work with AppsSync for GQL seamlessly - no necessary re-architecture between the 2 endpoint services

Some feedback on design @noah-paige:

I like the idea of a separate repo. I like the idea of a python package that will provide both the SDK and the CLI utils.

1) Generating schema.graphql. When will this happen? Do we want to generate this automatically during the build of the SDK? If so we need to somehow specify the version of data.all to know which repo of data.all to use to generate the schema? Or perhaps we should have data.all publish this file as part of its own build and then during SDK build we store in configuration which version of data.all we're building against and then the build will simply fetch the schema file from data.all repo? Could you add some details how exactly this will work how developers of the SDK will get a version of schema.graphql and how it will be included into the python package during the build?

2) There's a plan to move to AppSync and I believe this would mean the schema starts living in AppSync? And the Schema would be defined in CDK? It means that we would still need some way to generate the schema? Can you add some clarity in the design how the schema file would work in the AppSync world and how would that affect local development and building of the library.

3) Profiles are a nice idea for CLI but I'm struggling to make sense how they would work for SDKs. In your example for Lambd with secret_arn what is the purpose of passing a profile name there? Why is it needed?

4) I dislike the idea of '--auth-strategy Custom' or auth_strategy="Custom". That doesn't look right to me. In my view if you are creating a profile you should select what type of profile you're creating cognito or custom_idp .. if you're selecting custom_idp then you need to provide all the extra things that needs and that creates a new profile for you and then you use that as normal. Then it will also make sense for the lambda example if I'm connecting from a lambda then my secret will hold the full profile that can either be a cognito profile or custom_idp profile and the client will figure out what type it is.

Going a bit further I can see 2 types of custom IDP profiles with Oauth2 flows. One is a basic one which uses what's called a username/password flow in Oauth2 where there's no authorization flow. This is suitable when you want to call data.all apis from a lambda via SDK. But for example if you're a human user running CLI commands then you could create a profile that doesn't use a username/password but where the CLI will open your browser and guide you to authorization screen (for example google cli utils do that when you want to get credentials for GCP environments). You could start with basic support username/password but that will make some users unhappy who do not use username/passwords with their IDP.

We can meet to discuss this last point a bit more..

When authenticating the user with the CLI, the user information is also stored (~/.dataall_cli/config.yaml). Does that mean when the token is expired the same user information can be used to fetch another token? Or is it used for some other reason ?

Thanks both for the great comments + questions above, here are some thoughts/response

Response to @zsaltys Comments:

(1)For building the schema file - the simplest case and the initial idea is to have a static schema file already present in the SDK repo (or can be fetched from data.all repo if required). You bring up a good concern around versioning:
- What dataall version is the schema file representing?
- What if modules are disabled? How will schema file be updated to remove APIs that do not exist?
- What if users add additional custom modules (and thus custom APIs) to their specific dataall deployment?

With the above in mind I propose we still keep a set of static schema files (using latest release static schema by default), with a user able to specify a particular data.all versions schema if need a version earlier than the latest release. (to handle version matching)

Additionally, if moving to AppSync to host the GQL endpoint, we can have the schema be retrieved on initialization of the SDK client automatically by running introspection queries against the endpoint… meaning we can generate the schema file specific to that dataall deployment removing any disabled modules and including any additional APIs developed (to handle modules enabled/disabled and custom development)

(2) When moving to AppSync the creation of the schema file becomes quite easy. To retrieve the schema we can either use a built-in boto3 api get_introspection_schema (link) or can run introspection queries against the endpoint to find out the schema (link)
(3) Think that the SDK can be run either in local scripting or in AWS compute resources:
- For local python scripts, profile here would be very similar to boto3.session(profile_name=“PROFILE_NAME”) where the profile has already been configured (i.e. by running dataall_cli configure - similar to awscli configure) and saved in a local file ~/.dataall_cli/config.yaml . Then we can use profile to reference different users onboarded to a dataall deployment (this would be the default behavior unless secret_arn specified - and by default would likely try to use a profile named “default”)
- For AWS compute resources we needed another way to store user information separate from ~/.dataall_cli/config.yaml and that is why we also introduced this parameter for secret_arn. This is so that the Lambda, Glue, EMR, or whatever service role can pull the user information from Secrets Manager securely
(4) Agreed - I have since removed auth_strategy parameter from the client method and have added it as an option to select when configuring a dataall user. Depending on the type of user being configured (via dataall_cli configure) the user with be prompted to provide specific information required. Then it is the same pattern to use either type of profile
Last point is interesting to have redirects directly to an auth screen to have a user pass auth that way - something I will look into with google cli and we can discuss further for sure

Response to @TejasRGitHub Comments:

Yes we store the user profile information in ~/.dataall_cli/config.yaml so that we can fetch a new token if the current one is expired. Felt like a far better user experience than to have a user manually enter all of the required information every time a token expires (which could super quick such as ~1 hour)

data-dot-all / dataall

Data.all CLI and SDK #950