generate `dlt` resources from the openAPI specification

rudolfix commented 1 year ago

Motivation Most of the APIs have their openAPI definitions accessible and it is a common approach to generate python clients automatically. We want to do the same and generate the dlt.resources corresponding to that api endpoints. The intention is to generate an initial code for the dlt user than needs to access a new API. As your first test cases pick Pipedrive and Hubspot (which we know to generate very nice datasets with dlt). For more openAPI specs look in rapidapi

As mentioned, python clients are generated this way. Here's one that is template based: https://github.com/openapi-generators/openapi-python-client. We may be able to use the same approach

Requirements

- [ ] for all the GET resources generate corresponding dlt.resources. convert the path and query parameters into input arguments, generate typing and defaults (I believe that you can copy it from the python client generating code)
- [ ] try to figure out the authentication method in the open api. provide a helper method for each authentication type, pass the required authentication elements (ie. bearer token, api key and secret or whatever it is) to the dlt.resource methods and mark them with dlt.secret.value so they are automatically injected.
- [ ] most of the apis define the returned documents with json schemas. convert the types into TTableSchema so we have correct types for table columns. (let's discuss the nested tables later)
- [ ] when the list of object are returned we should be able to (optionally - a flag when generating code possibly) to still yield items one by one.
- [ ] base URL for the API should be taken from the openapi servers property when available. When multiple servers create a mapping and accept argument/config prop either a server name (e.g. production) or url https://example.com/api

Additional Heuristics

- [ ] most of the apis let the user to filter for lists of object ids and then provide the endpoint to get those object details. ideally we would create dlt.transformer that allow pipelining data from lists to those "enriching" endpoints. let's find a way to do it at least in some of the cases.
- [ ] it would be cool to figure out which of the fields in the returned data are unique identifiers and add primary_key hints
- [ ] it would be cool to figure how data is paginated when a list of items is requested and generate paginator code and dlt.resources that can take N items or all items across all pages. there's only a few pagination types. alternatively we could pass options like: this is pagination of type x and this is the next_page element etc.

Deliverable:

as an extension to dlt init command.
we'll start with separate branch and the generator as standalone script. then merge when it is useful

steinitzu commented 1 year ago

Been doing some hacking on this.

openapi-python-client looks like a good base to build on. The templating works pretty well and creating dlt schema should be a similar approach to how the data classes are generated.
The openapi parser needs some tweaks. Seems to fail easily on odd/non standard things. E.g. pipedrive has boolean type enums, missing type fields and such that could simply be ignored or inferred.
Would be good if we could still generate the parts that can be parsed and show warnings when something's off. The user can then fix the code as needed.

I think pagination is also crucial for this to be useful, ideally we handle most cases without having to edit the generated code.

Some ideas on that and dependant endpoints:

Pagination and array endpoints

We can try to detect what kind of pagination is used. E.g. offset, cursor, etc

Detect whether the endpoint returns a list of objects.

Check the response schema for a data array. Sometimes there is an array in the root, sometimes the data is nested within an envelope.
The resource should yield the data array, not the top level object which may have some metadata.
Optionally the resource should accept a JSON path or callback argument to point to the data field.
Find the pagination request parameter

Usually a query parameter. A guess could be a field which fuzzily matches something like "page", "pagination", "offset", "cursor", "after", etc in description or name. If the same field is found in the response schema it would be a good guess.
Let user override with argument.
Determine what kind of pagination

For example if the pagination param is an integer, it's likely offset pagination
If it's a string it's likely a cursor

Offset pagination is easy, increment until there are no more results.

Cursor pagination is tricky because we need to extract the cursor from the response data.
There is no standard for how this is returned. A best-attempt could be to look for a field which has the same name as the cursor query param, or something like "next".
Otherwise the resource should take a JSON path argument to override.

Sometimes APIs also return a complete URL to the next page rather than a cursor param so it should check whether the value looks like a URL.

The resources could also accept a next_page callback arg. Callback receives the arguments of the previous request and previous response and should return new arguments (or None/raise a special exception to say "no more pages").

Dependant endpoints

Endpoints that take path parameters received in a response of another endpoint.

For example GET /deals/{id}/participants uses the IDs from results of GET /deals

These kinds of endpoints should be implemented as transformers.

For a nice RESTful API like this we can construct the resource dependency tree by traversing the API paths.
E.g. splitting the path before the path param /deals/{id}/abc -> /deals to know it should depend on the /deals resource.

Can be done recursively to handle arbitrary nesting: /a/{a_id}/b/{b_id}/c/{c_id}/d/{d_id}

By default we assume the path param matches a named field on the parent object, but not a guarantee.
For example we could have /deals/{deal_id}/abc but the deals endpoint returns objects {id: 123, ...}

An id field is probably most common so we could fallback on looking for an id field. But also there should be a way for user override with a JSON path.

For cases when the dependency tree can't be inferred we handle it somehow. We could generate a transformer anyway (or both transformer and standalone resource) and let the user decide how to wire it up.

Testing materials: https://github.com/APIs-guru/openapi-directory

steinitzu commented 1 year ago

Plan for handling authentication:

@dlt.resource
def get_deals(
    credentials: Union[MyApiKeyCredentials, MyOAuth2Credentials] = dlt.secrets.value,
    ...

resource receives credentials with ad-hoc config spec which is generated for the security schemes in openapi.

Union when the endpoint has multiple authentication types, can be discriminated by "type", "in", "scheme" attributes

The credentials object is sent with other request arguments to client. The API client calls to_http_params to get auth cookies/headers/query params which get merged with the request params. E.g.

class MyApiKeyCredentials(ConfigSpec):
    type: Literal["apiKey"] = "apiKey"
    in: Literal["header"] = "header"
    name: Literal["X-SOME-API-KEY-HEADER"] = "X-SOME-API-KEY-HEADER"
    api_key: str  # From secrets.yaml

    def to_http_params(self):
        return dict(cookies={}, headers={self.name: self.api_key}, params={})

class MyQueryApiKeyCredentials(ConfigSpec):
    type: Literal["apiKey"] = "apiKey"
    in: Literal["query"] = "query"
    name: Literal["api_token"] = "api_token"
    api_key: str  # From secrets.yaml

    def to_http_params(self):
        # Get cookies/query params/headers to merge into request params
        return dict(cookies={}, headers={}, params={self.name: self.api_key})

class MyBasicAuthCredentials(Configspec):
    type: Literal['http'] = 'http'
    scheme: Literal['basic'] = 'basic''
    username: str  # from secrets.yaml
    password: str

    def to_http_params(self):
        return dict(cookies={}, headers={'Authorization': b64encode(f"{self.username}:{self.password}")}, params={})

Note OAuth2 is always a bearer token in the request. But probably should have a config for each flow supported in the api.
For flows that support refresh token and client credentials it would be nice to accept either access token or refresh-token/client ID/secret and have some helper to facilitate the refresh.

rudolfix commented 1 year ago

@steinitzu this all makes sense. let's try to build a first version in which those heuristics does not need to work (I assume this will be the hardest part). I'll give it a run on a few APIs.

rudolfix commented 1 year ago

Let's try to use pokemnon api and squeeze something useful :)

this is my take on improvements

handle resources/transformers with required arguments. now generated code fails (ie. for those that require id to be passed)
you may create resources dynamically to avoid that problem and return unbound resources for late binding
we should (heurestically) infer the data types returned by given resource and name the table (table_name accordingly). for example pokemon_list and pokemon_read return a list of Pokemon and single Pokemon respectively, the table name should be thus pokemon
resource generation:
- let's generate resources only for endpoints that return list of objects.
- for endpoints that return a single object we should (heuristically ie using path) match it with a list resource and generate a transformer that is connected to that resource.
- for any other endpoint that returns single elements let's return unconnected transformer (if takes arguments) or a resource (if it does not)
- transformers should not be initially selected
yield lists not list wrapped in dictionaries (another heuristics but let's make that work for pokemon first)
fun part: we allow user to select which resources they want to be generated by using https://github.com/tmbo/questionary

fun part 2: we should order the resources putting the most relevant resources first. the relevant resources are those that return types that reference the most of other types. I bet that will be Pokemon type in case of Pokemon API. look at this:

Pokemon:
  type: object
  properties:
    id:
      type: integer
      format: int32
    name:
      type: string
    base_experience:
      type: integer
      format: int32
    height:
      type: integer
      format: int32
    is_default:
      type: boolean
    order:
      type: integer
      format: int32
    weight:
      type: integer
      format: int32
    abilities:
      type: array
      items:
        $ref: '#/components/schemas/PokemonAbility'
    forms:
      type: array
      items:
        $ref: '#/components/schemas/PokemonForm'
    game_indices:
      type: array
      items:
        $ref: '#/components/schemas/VersionGameIndex'
    held_items:
      type: array
      items:
        $ref: '#/components/schemas/PokemonHeldItem'
    location_area_encounters:
      type: string
    moves:
      type: array
      items:
        $ref: '#/components/schemas/PokemonMove'
    sprites:
      $ref: '#/components/schemas/PokemonSprites'
    species:
      $ref: '#/components/schemas/NamedAPIResource'
    stats:
      type: array
      items:
        $ref: '#/components/schemas/PokemonStat'
    types:
      type: array
      items:
        $ref: '#/components/schemas/PokemonType'

rudolfix commented 1 year ago

Make the demo nice

In order of importance

allow selecting the endpoints with questionary. (this is actually core of the demo - right now we have so much code generated...)
can we somehow guess the client name? maybe from url? if not we may add additional argument to cli or hardcode it https://raw.githubusercontent.com/cliffano/pokeapi-clients/main/specification/pokeapi.yml gives us pokeapi

(OK I actually see that generator takes it from info tag so I will just modify the spec for poke)

info:
  title: Planets and Webhooks Demo API

set the pipeline name etc.

if __name__ == "__main__":
pipeline = dlt.pipeline(pipeline_name="pokeapi_pipeline", dataset_name="pokeapi_data", destination="duckdb", full_refresh=False)
source = _client()
# let's add and print info!
info = pipeline.run(source.with_resources("pokemon_list"))
print(info)

can we add and print info as above?
when I do this openapi-python-client generate --url https://raw.githubusercontent.com/cliffano/pokeapi-clients/main/specification/pokeapi.yml the files are generated in folder called "-client", can we have something different ie. "openapi_client" as default?

create config.toml and set base_url to
```
servers:
- url: 'https://pokeapi.co/'
```

(Future) Make it useful and hackable

separate API python client from the sources / resources. in other words: separate things that our users will modify from things they should never touch.
client code is a mess and user should never see it. we could use the original template / slightly modified template to generate the client
on top of that generate our sources / resources where we use the client (maybe another run where we use different template?). sources and resources should be in a single file so user can easily change it
pagination: right now pagination would need to be implemented for all the endpoints separately. ideally they would be a part of the client and all list returning resources would go through some default pagination function that user can easily change

rudolfix commented 4 months ago

@sh-rp @steinitzu I think we can say this is done. took just 1,5 year :>

dlt-hub / dlt