dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.44k stars 159 forks source link

generate `dlt` resources from the openAPI specification #120

Closed rudolfix closed 4 months ago

rudolfix commented 1 year ago

Motivation Most of the APIs have their openAPI definitions accessible and it is a common approach to generate python clients automatically. We want to do the same and generate the dlt.resources corresponding to that api endpoints. The intention is to generate an initial code for the dlt user than needs to access a new API. As your first test cases pick Pipedrive and Hubspot (which we know to generate very nice datasets with dlt). For more openAPI specs look in rapidapi

As mentioned, python clients are generated this way. Here's one that is template based: https://github.com/openapi-generators/openapi-python-client. We may be able to use the same approach

Requirements

    • [ ] for all the GET resources generate corresponding dlt.resources. convert the path and query parameters into input arguments, generate typing and defaults (I believe that you can copy it from the python client generating code)
    • [ ] try to figure out the authentication method in the open api. provide a helper method for each authentication type, pass the required authentication elements (ie. bearer token, api key and secret or whatever it is) to the dlt.resource methods and mark them with dlt.secret.value so they are automatically injected.
    • [ ] most of the apis define the returned documents with json schemas. convert the types into TTableSchema so we have correct types for table columns. (let's discuss the nested tables later)
    • [ ] when the list of object are returned we should be able to (optionally - a flag when generating code possibly) to still yield items one by one.
    • [ ] base URL for the API should be taken from the openapi servers property when available. When multiple servers create a mapping and accept argument/config prop either a server name (e.g. production) or url https://example.com/api

Additional Heuristics

    • [ ] most of the apis let the user to filter for lists of object ids and then provide the endpoint to get those object details. ideally we would create dlt.transformer that allow pipelining data from lists to those "enriching" endpoints. let's find a way to do it at least in some of the cases.
    • [ ] it would be cool to figure out which of the fields in the returned data are unique identifiers and add primary_key hints
    • [ ] it would be cool to figure how data is paginated when a list of items is requested and generate paginator code and dlt.resources that can take N items or all items across all pages. there's only a few pagination types. alternatively we could pass options like: this is pagination of type x and this is the next_page element etc.

Deliverable:

steinitzu commented 1 year ago

Been doing some hacking on this.

openapi-python-client looks like a good base to build on. The templating works pretty well and creating dlt schema should be a similar approach to how the data classes are generated.
The openapi parser needs some tweaks. Seems to fail easily on odd/non standard things. E.g. pipedrive has boolean type enums, missing type fields and such that could simply be ignored or inferred.
Would be good if we could still generate the parts that can be parsed and show warnings when something's off. The user can then fix the code as needed.

I think pagination is also crucial for this to be useful, ideally we handle most cases without having to edit the generated code.

Some ideas on that and dependant endpoints:

Pagination and array endpoints

We can try to detect what kind of pagination is used. E.g. offset, cursor, etc

  1. Detect whether the endpoint returns a list of objects.

    Check the response schema for a data array. Sometimes there is an array in the root, sometimes the data is nested within an envelope.
    The resource should yield the data array, not the top level object which may have some metadata.
    Optionally the resource should accept a JSON path or callback argument to point to the data field.

  2. Find the pagination request parameter

    Usually a query parameter. A guess could be a field which fuzzily matches something like "page", "pagination", "offset", "cursor", "after", etc in description or name. If the same field is found in the response schema it would be a good guess.
    Let user override with argument.

  3. Determine what kind of pagination

    For example if the pagination param is an integer, it's likely offset pagination
    If it's a string it's likely a cursor

Offset pagination is easy, increment until there are no more results.

Cursor pagination is tricky because we need to extract the cursor from the response data.
There is no standard for how this is returned. A best-attempt could be to look for a field which has the same name as the cursor query param, or something like "next".
Otherwise the resource should take a JSON path argument to override.

Sometimes APIs also return a complete URL to the next page rather than a cursor param so it should check whether the value looks like a URL.

The resources could also accept a next_page callback arg. Callback receives the arguments of the previous request and previous response and should return new arguments (or None/raise a special exception to say "no more pages").

Dependant endpoints

Endpoints that take path parameters received in a response of another endpoint.

For example GET /deals/{id}/participants uses the IDs from results of GET /deals

These kinds of endpoints should be implemented as transformers.

For a nice RESTful API like this we can construct the resource dependency tree by traversing the API paths.
E.g. splitting the path before the path param /deals/{id}/abc -> /deals to know it should depend on the /deals resource.

Can be done recursively to handle arbitrary nesting: /a/{a_id}/b/{b_id}/c/{c_id}/d/{d_id}

By default we assume the path param matches a named field on the parent object, but not a guarantee.
For example we could have /deals/{deal_id}/abc but the deals endpoint returns objects {id: 123, ...}

An id field is probably most common so we could fallback on looking for an id field. But also there should be a way for user override with a JSON path.

For cases when the dependency tree can't be inferred we handle it somehow. We could generate a transformer anyway (or both transformer and standalone resource) and let the user decide how to wire it up.

Testing materials: https://github.com/APIs-guru/openapi-directory

steinitzu commented 1 year ago

Plan for handling authentication:

@dlt.resource
def get_deals(
    credentials: Union[MyApiKeyCredentials, MyOAuth2Credentials] = dlt.secrets.value,
    ...

resource receives credentials with ad-hoc config spec which is generated for the security schemes in openapi.

Union when the endpoint has multiple authentication types, can be discriminated by "type", "in", "scheme" attributes

The credentials object is sent with other request arguments to client. The API client calls to_http_params to get auth cookies/headers/query params which get merged with the request params. E.g.

class MyApiKeyCredentials(ConfigSpec):
    type: Literal["apiKey"] = "apiKey"
    in: Literal["header"] = "header"
    name: Literal["X-SOME-API-KEY-HEADER"] = "X-SOME-API-KEY-HEADER"
    api_key: str  # From secrets.yaml

    def to_http_params(self):
        return dict(cookies={}, headers={self.name: self.api_key}, params={})
class MyQueryApiKeyCredentials(ConfigSpec):
    type: Literal["apiKey"] = "apiKey"
    in: Literal["query"] = "query"
    name: Literal["api_token"] = "api_token"
    api_key: str  # From secrets.yaml

    def to_http_params(self):
        # Get cookies/query params/headers to merge into request params
        return dict(cookies={}, headers={}, params={self.name: self.api_key})
class MyBasicAuthCredentials(Configspec):
    type: Literal['http'] = 'http'
    scheme: Literal['basic'] = 'basic''
    username: str  # from secrets.yaml
    password: str

    def to_http_params(self):
        return dict(cookies={}, headers={'Authorization': b64encode(f"{self.username}:{self.password}")}, params={})

Note OAuth2 is always a bearer token in the request. But probably should have a config for each flow supported in the api.
For flows that support refresh token and client credentials it would be nice to accept either access token or refresh-token/client ID/secret and have some helper to facilitate the refresh.

rudolfix commented 1 year ago

@steinitzu this all makes sense. let's try to build a first version in which those heuristics does not need to work (I assume this will be the hardest part). I'll give it a run on a few APIs.

rudolfix commented 1 year ago

Let's try to use pokemnon api and squeeze something useful :)

this is my take on improvements

  1. handle resources/transformers with required arguments. now generated code fails (ie. for those that require id to be passed)
  2. you may create resources dynamically to avoid that problem and return unbound resources for late binding
  3. we should (heurestically) infer the data types returned by given resource and name the table (table_name accordingly). for example pokemon_list and pokemon_read return a list of Pokemon and single Pokemon respectively, the table name should be thus pokemon
  4. resource generation:
    • let's generate resources only for endpoints that return list of objects.
    • for endpoints that return a single object we should (heuristically ie using path) match it with a list resource and generate a transformer that is connected to that resource.
    • for any other endpoint that returns single elements let's return unconnected transformer (if takes arguments) or a resource (if it does not)
    • transformers should not be initially selected
  5. yield lists not list wrapped in dictionaries (another heuristics but let's make that work for pokemon first)
  6. fun part: we allow user to select which resources they want to be generated by using https://github.com/tmbo/questionary
  7. fun part 2: we should order the resources putting the most relevant resources first. the relevant resources are those that return types that reference the most of other types. I bet that will be Pokemon type in case of Pokemon API. look at this:
    Pokemon:
      type: object
      properties:
        id:
          type: integer
          format: int32
        name:
          type: string
        base_experience:
          type: integer
          format: int32
        height:
          type: integer
          format: int32
        is_default:
          type: boolean
        order:
          type: integer
          format: int32
        weight:
          type: integer
          format: int32
        abilities:
          type: array
          items:
            $ref: '#/components/schemas/PokemonAbility'
        forms:
          type: array
          items:
            $ref: '#/components/schemas/PokemonForm'
        game_indices:
          type: array
          items:
            $ref: '#/components/schemas/VersionGameIndex'
        held_items:
          type: array
          items:
            $ref: '#/components/schemas/PokemonHeldItem'
        location_area_encounters:
          type: string
        moves:
          type: array
          items:
            $ref: '#/components/schemas/PokemonMove'
        sprites:
          $ref: '#/components/schemas/PokemonSprites'
        species:
          $ref: '#/components/schemas/NamedAPIResource'
        stats:
          type: array
          items:
            $ref: '#/components/schemas/PokemonStat'
        types:
          type: array
          items:
            $ref: '#/components/schemas/PokemonType'
rudolfix commented 1 year ago

Make the demo nice

In order of importance

  1. allow selecting the endpoints with questionary. (this is actually core of the demo - right now we have so much code generated...)
  2. can we somehow guess the client name? maybe from url? if not we may add additional argument to cli or hardcode it https://raw.githubusercontent.com/cliffano/pokeapi-clients/main/specification/pokeapi.yml gives us pokeapi

(OK I actually see that generator takes it from info tag so I will just modify the spec for poke)

info:
  title: Planets and Webhooks Demo API
  1. create config.toml and set base_url to
    servers:
    - url: 'https://pokeapi.co/'

(Future) Make it useful and hackable

rudolfix commented 4 months ago

@sh-rp @steinitzu I think we can say this is done. took just 1,5 year :>