dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.39k stars 159 forks source link

[rest_api source] can't detect pagination for github api and poke api #1915

Open AstrakhantsevaAA opened 1 day ago

AstrakhantsevaAA commented 1 day ago

dlt version

1.1.0

Describe the problem

The rest_api source cannot autodetect pagination for github api and poke api, this impacts our tutorial.

If you run this pipeline, you will get a list of Fallback paginator warnings for both: github api and poke api.

This fallback also causes a rate limiting error, the rest_api source continually requests github api until the error occurs.

Expected behavior

according to our tutorial, rest_api source should automatically detect such simple types of pagination.

Steps to reproduce

  1. run
    dlt init rest_api duckdb
  2. run
    python rest_api_pipeline.py

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

No response

dlt destination

No response

Other deployment details

No response

Additional information

No response

burnash commented 1 day ago

Thank your for the issue @AstrakhantsevaAA, I believe the rest_api detects the paginator successfully. If I'm not mistaken the message is related to "child" resources (single page) where there's no pagination present. In this case paginator uses SinglePagePaginator. Do you see any data loaded when you running the pipelines?

AstrakhantsevaAA commented 1 day ago

@burnash yeah, I think you are right, it's not clear from the warning message. Anyway this part of tutorial should be adjusted, by default we can't run this pipeline, because of rate limits, I think we can reduce the amount of data for issues endpoints:

"initial_value": pendulum.today().subtract(days=**7**).to_iso8601_string(),

And these warning scares our new users :D can we log this warning in the beginning not for each request?