Closed rudolfix closed 4 months ago
Been doing some hacking on this.
openapi-python-client looks like a good base to build on. The templating works pretty well and creating dlt schema should be a similar approach to how the data classes are generated.
The openapi parser needs some tweaks. Seems to fail easily on odd/non standard things. E.g. pipedrive has boolean type enums, missing type fields and such that could simply be ignored or inferred.
Would be good if we could still generate the parts that can be parsed and show warnings when something's off. The user can then fix the code as needed.
I think pagination is also crucial for this to be useful, ideally we handle most cases without having to edit the generated code.
Some ideas on that and dependant endpoints:
We can try to detect what kind of pagination is used. E.g. offset, cursor, etc
Detect whether the endpoint returns a list of objects.
Check the response schema for a data array. Sometimes there is an array in the root, sometimes the data is nested within an envelope.
The resource should yield the data array, not the top level object which may have some metadata.
Optionally the resource should accept a JSON path or callback argument to point to the data field.
Find the pagination request parameter
Usually a query parameter. A guess could be a field which fuzzily matches something like "page", "pagination", "offset", "cursor", "after", etc in description or name. If the same field is found in the response schema it would be a good guess.
Let user override with argument.
Determine what kind of pagination
For example if the pagination param is an integer, it's likely offset pagination
If it's a string it's likely a cursor
Offset pagination is easy, increment until there are no more results.
Cursor pagination is tricky because we need to extract the cursor from the response data.
There is no standard for how this is returned. A best-attempt could be to look for a field which has the same name as the cursor
query param, or something like "next"
.
Otherwise the resource should take a JSON path argument to override.
Sometimes APIs also return a complete URL to the next page rather than a cursor param so it should check whether the value looks like a URL.
The resources could also accept a next_page
callback arg. Callback receives the arguments of the previous request and previous response and should return new arguments (or None
/raise a special exception to say "no more pages").
Endpoints that take path parameters received in a response of another endpoint.
For example GET /deals/{id}/participants
uses the IDs from results of GET /deals
These kinds of endpoints should be implemented as transformers.
For a nice RESTful API like this we can construct the resource dependency tree by traversing the API paths.
E.g. splitting the path before the path param /deals/{id}/abc
-> /deals
to know it should depend on the /deals
resource.
Can be done recursively to handle arbitrary nesting: /a/{a_id}/b/{b_id}/c/{c_id}/d/{d_id}
By default we assume the path param matches a named field on the parent object, but not a guarantee.
For example we could have /deals/{deal_id}/abc
but the deals endpoint returns objects {id: 123, ...}
An id
field is probably most common so we could fallback on looking for an id
field. But also there should be a way for user override with a JSON path.
For cases when the dependency tree can't be inferred we handle it somehow. We could generate a transformer anyway (or both transformer and standalone resource) and let the user decide how to wire it up.
Testing materials: https://github.com/APIs-guru/openapi-directory
Plan for handling authentication:
@dlt.resource
def get_deals(
credentials: Union[MyApiKeyCredentials, MyOAuth2Credentials] = dlt.secrets.value,
...
resource receives credentials with ad-hoc config spec which is generated for the security schemes in openapi.
Union when the endpoint has multiple authentication types, can be discriminated by "type", "in", "scheme" attributes
The credentials object is sent with other request arguments to client. The API client calls to_http_params
to get
auth cookies/headers/query params which get merged with the request params. E.g.
class MyApiKeyCredentials(ConfigSpec):
type: Literal["apiKey"] = "apiKey"
in: Literal["header"] = "header"
name: Literal["X-SOME-API-KEY-HEADER"] = "X-SOME-API-KEY-HEADER"
api_key: str # From secrets.yaml
def to_http_params(self):
return dict(cookies={}, headers={self.name: self.api_key}, params={})
class MyQueryApiKeyCredentials(ConfigSpec):
type: Literal["apiKey"] = "apiKey"
in: Literal["query"] = "query"
name: Literal["api_token"] = "api_token"
api_key: str # From secrets.yaml
def to_http_params(self):
# Get cookies/query params/headers to merge into request params
return dict(cookies={}, headers={}, params={self.name: self.api_key})
class MyBasicAuthCredentials(Configspec):
type: Literal['http'] = 'http'
scheme: Literal['basic'] = 'basic''
username: str # from secrets.yaml
password: str
def to_http_params(self):
return dict(cookies={}, headers={'Authorization': b64encode(f"{self.username}:{self.password}")}, params={})
Note OAuth2 is always a bearer token in the request. But probably should have a config for each flow supported in the api.
For flows that support refresh token and client credentials it would be nice to accept either access token or refresh-token/client ID/secret and have some helper to facilitate the refresh.
@steinitzu this all makes sense. let's try to build a first version in which those heuristics does not need to work (I assume this will be the hardest part). I'll give it a run on a few APIs.
Let's try to use pokemnon api and squeeze something useful :)
this is my take on improvements
id
to be passed)table_name
accordingly). for example pokemon_list
and pokemon_read
return a list of Pokemon and single Pokemon respectively, the table name should be thus pokemon
Pokemon:
type: object
properties:
id:
type: integer
format: int32
name:
type: string
base_experience:
type: integer
format: int32
height:
type: integer
format: int32
is_default:
type: boolean
order:
type: integer
format: int32
weight:
type: integer
format: int32
abilities:
type: array
items:
$ref: '#/components/schemas/PokemonAbility'
forms:
type: array
items:
$ref: '#/components/schemas/PokemonForm'
game_indices:
type: array
items:
$ref: '#/components/schemas/VersionGameIndex'
held_items:
type: array
items:
$ref: '#/components/schemas/PokemonHeldItem'
location_area_encounters:
type: string
moves:
type: array
items:
$ref: '#/components/schemas/PokemonMove'
sprites:
$ref: '#/components/schemas/PokemonSprites'
species:
$ref: '#/components/schemas/NamedAPIResource'
stats:
type: array
items:
$ref: '#/components/schemas/PokemonStat'
types:
type: array
items:
$ref: '#/components/schemas/PokemonType'
In order of importance
https://raw.githubusercontent.com/cliffano/pokeapi-clients/main/specification/pokeapi.yml
gives us pokeapi
(OK I actually see that generator takes it from info
tag so I will just modify the spec for poke)
info:
title: Planets and Webhooks Demo API
if __name__ == "__main__":
pipeline = dlt.pipeline(pipeline_name="pokeapi_pipeline", dataset_name="pokeapi_data", destination="duckdb", full_refresh=False)
source = _client()
# let's add and print info!
info = pipeline.run(source.with_resources("pokemon_list"))
print(info)
openapi-python-client generate --url https://raw.githubusercontent.com/cliffano/pokeapi-clients/main/specification/pokeapi.yml
the files are generated in folder called "-client", can we have something different ie. "openapi_client" as default?servers:
- url: 'https://pokeapi.co/'
@sh-rp @steinitzu I think we can say this is done. took just 1,5 year :>
Motivation Most of the APIs have their openAPI definitions accessible and it is a common approach to generate python clients automatically. We want to do the same and generate the
dlt.resources
corresponding to that api endpoints. The intention is to generate an initial code for thedlt
user than needs to access a new API. As your first test cases pick Pipedrive and Hubspot (which we know to generate very nice datasets with dlt). For more openAPI specs look in rapidapiAs mentioned, python clients are generated this way. Here's one that is template based: https://github.com/openapi-generators/openapi-python-client. We may be able to use the same approach
Requirements
dlt.resources
. convert the path and query parameters into input arguments, generate typing and defaults (I believe that you can copy it from the python client generating code)dlt.resource
methods and mark them withdlt.secret.value
so they are automatically injected.TTableSchema
so we have correct types for table columns. (let's discuss the nested tables later)servers
property when available. When multiple servers create a mapping and accept argument/config prop either a server name (e.g.production
) or urlhttps://example.com/api
Additional Heuristics
dlt.transformer
that allow pipelining data from lists to those "enriching" endpoints. let's find a way to do it at least in some of the cases.primary_key
hintsdlt.resources
that can take N items or all items across all pages. there's only a few pagination types. alternatively we could pass options like: this is pagination of type x and this is the next_page element etc.Deliverable:
dlt init
command.