Open anmolsgandhi opened 8 months ago
Could be useful.. Especially for those who want to run scripts to do bunch of things like import a lot of datasets etc. Today when I want to validate things I'm querying RDS directly and it would be nice not to have to do so and query via CLI or APIs. I would say authentication is a big one to figure out here..... especially for IDP users. For example we use OKTA and I'm not even sure that I can register headless users in our IdP. Definitely think about that how headless users will get access who are not just using pure Cognito.
Appreciate the feedback @zsaltys. This feature has been requested frequently and has been implemented by some in different capacities. We believe it adds value to various use cases, including the ones you highlighted. I agree that examining authentication aspects will be crucial. We intend to address this initiative in a phased approach. As we delve deeper and gain more insights, we'll use this issue/PR to document and manage feedback and suggestions.
For the data.all SDK/CLI, we will host the SDK and CLI packages within a new repository (separate from data.all OS but within the same GitHub Org) with release cycles for each of the respective packages (repo name TBD)
A separate repository to host this code will provide the following benefits:
GraphQL Schema File: The key information that powers the SDK/CLI utility will the be GQL schema file which defines all of the supported API requests (i.e. queries and mutations) along with input and output types for those requests. We will leverage this file to dynamically generate the me
For reference in data.all OS today, we build this GQL schema already in the api_handler.py similar to the following code:
from dataall.base.api import bootstrap as bootstrap_schema, get_executable_schema
from graphql import print_schema
schema = get_executable_schema()
print(print_schema(schema))
Repo Layout The below is a high-level look at how the SDK/CLI data.all programmatic tools library will be organized:
schema/
schema.graphql # ---> GQL Schema File
...
dataall_core/
loader.py # ---> Load Schema GQL File
dataall_client.py # ---> Generate Methods Dynamically
base_client.py # ---> Class Where Methods Attached
auth/ # ---> Auth Classes (i.e. Cognito, Okta, etc.)
...
api/
__init__.py
Client.py
utils/
__init__.py
stacks_utils.py
...
cli/
__init__.py
Client.py
...
The above file layout follows similar design principals to other widely used SDK/CLI tooling (such as botocore, boto3 and awscli). A more in-depth description of the above file structure follows:
CLI Command dataall_cli configure to set up client - provide all user information (i.e. username, password, client_id, endpoint_url, and region)
dataall_cli client configure \
--username "dataall-username" \
--password "********" \
--client_id "********" \
--url "https://******.execute-api.<AWS_REGION>.amazonaws.com/prod/" \
--region "<AWS_REGION>" \
--profile "dataall-team";
Storing Credentials Locally
Running CLI
dataall_cli list_organizations --profile dataall-team
Running SDK
da_client = dataall.client(profile="dataall-team")
list_org_response = da_client.list_organizations()
Introduce an optional client-level parameter of auth_strategy (default: ‘Cognito’) for custom auth data.all deployments
dataall_cli list_organizations --profile dataall-team --auth-strategy Custom
OR
da_client = dataall.client(profile="dataall-team", auth_strategy="Custom")
import dataall_sdk
def lambda_handler(event, context):
# Set Up DA Client - UserA
da_client = dataall_sdk.client(profile="dataall-team-lambda", secret_arn="<SECRET_ARN>")
# # List Org - Ensure Org Creatd
list_org_response = da_client.list_organizations()
print(list_org_response)
Certainly , great feature to have. Q. As there are plans to use AppSync for GQL , will the also seemlessly work with AppSync ?
Certainly , great feature to have. Q. As there are plans to use AppSync for GQL , will the also seemlessly work with AppSync ?
Yes! It will work with AppsSync for GQL seamlessly - no necessary re-architecture between the 2 endpoint services
Some feedback on design @noah-paige:
I like the idea of a separate repo. I like the idea of a python package that will provide both the SDK and the CLI utils.
1) Generating schema.graphql. When will this happen? Do we want to generate this automatically during the build of the SDK? If so we need to somehow specify the version of data.all to know which repo of data.all to use to generate the schema? Or perhaps we should have data.all publish this file as part of its own build and then during SDK build we store in configuration which version of data.all we're building against and then the build will simply fetch the schema file from data.all repo? Could you add some details how exactly this will work how developers of the SDK will get a version of schema.graphql and how it will be included into the python package during the build?
2) There's a plan to move to AppSync and I believe this would mean the schema starts living in AppSync? And the Schema would be defined in CDK? It means that we would still need some way to generate the schema? Can you add some clarity in the design how the schema file would work in the AppSync world and how would that affect local development and building of the library.
3) Profiles are a nice idea for CLI but I'm struggling to make sense how they would work for SDKs. In your example for Lambd with secret_arn what is the purpose of passing a profile name there? Why is it needed?
4) I dislike the idea of '--auth-strategy Custom' or auth_strategy="Custom". That doesn't look right to me. In my view if you are creating a profile you should select what type of profile you're creating cognito or custom_idp .. if you're selecting custom_idp then you need to provide all the extra things that needs and that creates a new profile for you and then you use that as normal. Then it will also make sense for the lambda example if I'm connecting from a lambda then my secret will hold the full profile that can either be a cognito profile or custom_idp profile and the client will figure out what type it is.
Going a bit further I can see 2 types of custom IDP profiles with Oauth2 flows. One is a basic one which uses what's called a username/password flow in Oauth2 where there's no authorization flow. This is suitable when you want to call data.all apis from a lambda via SDK. But for example if you're a human user running CLI commands then you could create a profile that doesn't use a username/password but where the CLI will open your browser and guide you to authorization screen (for example google cli utils do that when you want to get credentials for GCP environments). You could start with basic support username/password but that will make some users unhappy who do not use username/passwords with their IDP.
We can meet to discuss this last point a bit more..
When authenticating the user with the CLI, the user information is also stored (~/.dataall_cli/config.yaml). Does that mean when the token is expired the same user information can be used to fetch another token? Or is it used for some other reason ?
Thanks both for the great comments + questions above, here are some thoughts/response
Response to @zsaltys Comments:
With the above in mind I propose we still keep a set of static schema files (using latest release static schema by default), with a user able to specify a particular data.all versions schema if need a version earlier than the latest release. (to handle version matching)
Additionally, if moving to AppSync to host the GQL endpoint, we can have the schema be retrieved on initialization of the SDK client automatically by running introspection queries against the endpoint… meaning we can generate the schema file specific to that dataall deployment removing any disabled modules and including any additional APIs developed (to handle modules enabled/disabled and custom development)
(2) When moving to AppSync the creation of the schema file becomes quite easy. To retrieve the schema we can either use a built-in boto3 api get_introspection_schema (link) or can run introspection queries against the endpoint to find out the schema (link)
(3) Think that the SDK can be run either in local scripting or in AWS compute resources:
(4) Agreed - I have since removed auth_strategy parameter from the client method and have added it as an option to select when configuring a dataall user. Depending on the type of user being configured (via dataall_cli configure) the user with be prompted to provide specific information required. Then it is the same pattern to use either type of profile
Last point is interesting to have redirects directly to an auth screen to have a user pass auth that way - something I will look into with google cli and we can discuss further for sure
Response to @TejasRGitHub Comments:
Description:
Introduce a command-line interface (CLI) and a software development kit (SDK) tool for data.all and publish these tools as separate packages through a publishing pipeline to PyPI. This feature aims to provide users with a seamless and efficient means of interfacing with data.all programmatically, allowing for enhanced automation, scripting, and integration into existing workflows. This not only facilitates ease of distribution but also enhances version control, making it simpler for users to incorporate and manage the tools in their environments.The CLI/SDK tool should encompass a comprehensive set of commands and functions, enabling users to perform operations related to data.all features such as datasets, dataset share, environment, organization and consumption role through programmatic interfaces. This enhanced CLI/SDK tool, coupled with PyPI publishing capabilities would significantly improve the overall user experience, empowering developers and administrators to leverage data.all functionalities in a programmatic manner.
Details:
Command-Line Interface (CLI):
Software Development Kit (SDK):
PyPI Publishing Component:
Benefits:
@noah-paige