data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
228 stars 82 forks source link

Data.all CLI and SDK #950

Open anmolsgandhi opened 8 months ago

anmolsgandhi commented 8 months ago

Description:

Introduce a command-line interface (CLI) and a software development kit (SDK) tool for data.all and publish these tools as separate packages through a publishing pipeline to PyPI. This feature aims to provide users with a seamless and efficient means of interfacing with data.all programmatically, allowing for enhanced automation, scripting, and integration into existing workflows. This not only facilitates ease of distribution but also enhances version control, making it simpler for users to incorporate and manage the tools in their environments.The CLI/SDK tool should encompass a comprehensive set of commands and functions, enabling users to perform operations related to data.all features such as datasets, dataset share, environment, organization and consumption role through programmatic interfaces. This enhanced CLI/SDK tool, coupled with PyPI publishing capabilities would significantly improve the overall user experience, empowering developers and administrators to leverage data.all functionalities in a programmatic manner.

Details:

Command-Line Interface (CLI):

Software Development Kit (SDK):

PyPI Publishing Component:

Benefits:

@noah-paige

zsaltys commented 8 months ago

Could be useful.. Especially for those who want to run scripts to do bunch of things like import a lot of datasets etc. Today when I want to validate things I'm querying RDS directly and it would be nice not to have to do so and query via CLI or APIs. I would say authentication is a big one to figure out here..... especially for IDP users. For example we use OKTA and I'm not even sure that I can register headless users in our IdP. Definitely think about that how headless users will get access who are not just using pure Cognito.

anmolsgandhi commented 8 months ago

Appreciate the feedback @zsaltys. This feature has been requested frequently and has been implemented by some in different capacities. We believe it adds value to various use cases, including the ones you highlighted. I agree that examining authentication aspects will be crucial. We intend to address this initiative in a phased approach. As we delve deeper and gain more insights, we'll use this issue/PR to document and manage feedback and suggestions.

noah-paige commented 6 months ago

Design Considerations

Repository Location

For the data.all SDK/CLI, we will host the SDK and CLI packages within a new repository (separate from data.all OS but within the same GitHub Org) with release cycles for each of the respective packages (repo name TBD)

A separate repository to host this code will provide the following benefits:

Repository Structure

GraphQL Schema File: The key information that powers the SDK/CLI utility will the be GQL schema file which defines all of the supported API requests (i.e. queries and mutations) along with input and output types for those requests. We will leverage this file to dynamically generate the me

For reference in data.all OS today, we build this GQL schema already in the api_handler.py similar to the following code:

from dataall.base.api import bootstrap as bootstrap_schema, get_executable_schema
from graphql import print_schema
schema = get_executable_schema()
print(print_schema(schema))

Repo Layout The below is a high-level look at how the SDK/CLI data.all programmatic tools library will be organized:

schema/
    schema.graphql         # ---> GQL Schema File
    ...
dataall_core/
    loader.py              # ---> Load Schema GQL File
    dataall_client.py      # ---> Generate Methods Dynamically
    base_client.py         # ---> Class Where Methods Attached
    auth/                  # ---> Auth Classes (i.e. Cognito, Okta, etc.)
    ...
api/
    __init__.py
    Client.py
    utils/
        __init__.py
        stacks_utils.py
    ...
cli/
    __init__.py
    Client.py
    ...

The above file layout follows similar design principals to other widely used SDK/CLI tooling (such as botocore, boto3 and awscli). A more in-depth description of the above file structure follows:

Authentication Flows

Configuration & Profiles

Supporting Custom Auth (i.e. Okta, ...)

OR

da_client = dataall.client(profile="dataall-team", auth_strategy="Custom")

Supporting AWS Compute Services (i.e. Lambda, Glue, etc.)

import dataall_sdk

def lambda_handler(event, context):
    # Set Up DA Client - UserA    
    da_client = dataall_sdk.client(profile="dataall-team-lambda", secret_arn="<SECRET_ARN>")

    # # List Org - Ensure Org Creatd
    list_org_response = da_client.list_organizations()
    print(list_org_response)
TejasRGitHub commented 5 months ago

Certainly , great feature to have. Q. As there are plans to use AppSync for GQL , will the also seemlessly work with AppSync ?

noah-paige commented 5 months ago

Certainly , great feature to have. Q. As there are plans to use AppSync for GQL , will the also seemlessly work with AppSync ?

Yes! It will work with AppsSync for GQL seamlessly - no necessary re-architecture between the 2 endpoint services

zsaltys commented 5 months ago

Some feedback on design @noah-paige:

I like the idea of a separate repo. I like the idea of a python package that will provide both the SDK and the CLI utils.

1) Generating schema.graphql. When will this happen? Do we want to generate this automatically during the build of the SDK? If so we need to somehow specify the version of data.all to know which repo of data.all to use to generate the schema? Or perhaps we should have data.all publish this file as part of its own build and then during SDK build we store in configuration which version of data.all we're building against and then the build will simply fetch the schema file from data.all repo? Could you add some details how exactly this will work how developers of the SDK will get a version of schema.graphql and how it will be included into the python package during the build?

2) There's a plan to move to AppSync and I believe this would mean the schema starts living in AppSync? And the Schema would be defined in CDK? It means that we would still need some way to generate the schema? Can you add some clarity in the design how the schema file would work in the AppSync world and how would that affect local development and building of the library.

3) Profiles are a nice idea for CLI but I'm struggling to make sense how they would work for SDKs. In your example for Lambd with secret_arn what is the purpose of passing a profile name there? Why is it needed?

4) I dislike the idea of '--auth-strategy Custom' or auth_strategy="Custom". That doesn't look right to me. In my view if you are creating a profile you should select what type of profile you're creating cognito or custom_idp .. if you're selecting custom_idp then you need to provide all the extra things that needs and that creates a new profile for you and then you use that as normal. Then it will also make sense for the lambda example if I'm connecting from a lambda then my secret will hold the full profile that can either be a cognito profile or custom_idp profile and the client will figure out what type it is.

Going a bit further I can see 2 types of custom IDP profiles with Oauth2 flows. One is a basic one which uses what's called a username/password flow in Oauth2 where there's no authorization flow. This is suitable when you want to call data.all apis from a lambda via SDK. But for example if you're a human user running CLI commands then you could create a profile that doesn't use a username/password but where the CLI will open your browser and guide you to authorization screen (for example google cli utils do that when you want to get credentials for GCP environments). You could start with basic support username/password but that will make some users unhappy who do not use username/passwords with their IDP.

We can meet to discuss this last point a bit more..

TejasRGitHub commented 5 months ago

When authenticating the user with the CLI, the user information is also stored (~/.dataall_cli/config.yaml). Does that mean when the token is expired the same user information can be used to fetch another token? Or is it used for some other reason ?

noah-paige commented 5 months ago

Thanks both for the great comments + questions above, here are some thoughts/response

Response to @zsaltys Comments:

With the above in mind I propose we still keep a set of static schema files (using latest release static schema by default), with a user able to specify a particular data.all versions schema if need a version earlier than the latest release. (to handle version matching)

Additionally, if moving to AppSync to host the GQL endpoint, we can have the schema be retrieved on initialization of the SDK client automatically by running introspection queries against the endpoint… meaning we can generate the schema file specific to that dataall deployment removing any disabled modules and including any additional APIs developed (to handle modules enabled/disabled and custom development)

Response to @TejasRGitHub Comments: