dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.72k stars 181 forks source link

rest_api: Extend generic API source to allow for incremental path parameters #1880

Open maxestorr opened 3 months ago

maxestorr commented 3 months ago

Source name

rest_api

Describe the data you'd like to see

I am using the generic API source to write my data pipeline declaratively, ingesting data from the ebird historical observations endpoint.

As you can see in its documentation linked above it's possible to incrementally load data from this endpoint, but not using traditional query parameters such as ?data_from=2024-08-01 but rather using path parameters, where each day has it's own endpoint path such as https://servername.com/2/data/obs/{{region-code}}/{{year}}/{{month}}/{{day}}.

Currently the generic API source allows you to query data incrementally using query parameters, using a config defined as so:

{
    "path": "posts",
    "data_selector": "results",  # Optional JSONPath to select the list of posts
    "params": {
        "created_since": {
            "type": "incremental",
            "cursor_path": "created_at", # The JSONPath to the field we want to track in each post
            "initial_value": "2024-01-25",
        },
    },
}

But I believe there's no such config that'd work for path parameters.

Are you a dlt user?

Yes, I'm already a dlt user.

Do you ready to contribute this extension?

No.

dlt destination

duck db

Additional information

No response

maxestorr commented 3 months ago

I'm about to head out but can share the code I'm working with on my return, my wider project's context is I'm trying to ingest data from this API using dlt's generic API declarative config, as well as Airflow for orchestration, and ran into a number of issues (this being one of them) which has prevented me from achieving this.