dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.67k stars 178 forks source link

rest_api: passing value for path parameters not working as expected #1882

Open francescomucio opened 5 months ago

francescomucio commented 5 months ago

dlt version

0.4.12

Source name

rest_api

Describe the problem

Configuring an endpoint like this:

            {
                "name": "user",
                "endpoint": {
                    "path": "users/{id}",
                    "params": {
                        "id": 2,
                    },
                },
            },

Is returing an url built like this:

https://reqres.in/api/users/%7Bid%7D?id=2

the expected is

https://reqres.in/api/users/2

Expected behavior

No response

Steps to reproduce

I am using the reqres.in testing api, with the following configuration:

import dlt
from rest_api import RESTAPIConfig, rest_api_source, RESTClient, DltResource

def load_reqres_in():
    reqres_in_config: RESTAPIConfig = {
        "client": {
            "base_url": "https://reqres.in/api",
        },
        "resources": [
            {
                "name": "user",
                "endpoint": {
                    "path": "users/{id}",
                    "params": {
                        "id": 2,
                    },
                },
            },
        ],
    }

    pipeline = dlt.pipeline(
        pipeline_name="reqres_in",
        destination="duckdb",
    )

    reqres_in_source = rest_api_source(reqres_in_config)

    load_info = pipeline.run(reqres_in_source)
    print(load_info)

if __name__ == "__main__":
    load_reqres_in()

How you are using the source?

I run this source in production.

Operating system

Linux

Runtime environment

Local

Python version

3.10.9

dlt destination

duckdb

Additional information

No response

burnash commented 5 months ago

Hey @francescomucio could you share a use case for having the path interpolated from param value?

...
"endpoint": {
    "path": "users/{id}",
    "params": {
        "id": 2,
    },
},
...
francescomucio commented 4 months ago

Hi @burnash,

I found out this problem while testing for a specific item to be returned by an API, but I can see this used in case of automatically generated resources or to partition a data loading getting only one resource per time (and the following calls).

For example, using the Datadog API I can imagine downloading the results of a set of tests runs, but not all of them; the workflow will be:

  1. Call https://api.datadoghq.com/api/v1/synthetics/tests/{public_id}/results to get the latest test results IDs

  2. Call https://api.datadoghq.com/api/v1/synthetics/tests/{public_id}/results/{result_id} to get the details of a specific test

This can be an overkill if we need to download the results of all the tests, public_id (the id of the test) can part of a list of tests that we need to download with dlt.

I hope it makes sense

burnash commented 4 months ago

Thanks for elaborating @francescomucio I believe the similar case has just been reported in the community Slack: https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1719291805818969

I'm thinking how to put this together with the current rest_api config. Let me know if you open to update https://github.com/dlt-hub/verified-sources/pull/499 as my idea is a bit different: most likely we'd need to adjust the child resource, not the parent.