dlt-hub / dlt-init-openapi

MIT License
20 stars 3 forks source link

Vimeo (categories) pipeline + pokemon test #80

Closed burnash closed 6 months ago

burnash commented 6 months ago

Hackathon Feedback

Questions & Answers

  1. Is it clear why we have created this, why it is useful, and what it is about?

    Yes

  2. Is it clear how the generator works? Did you manage to generate anything in the first 10 minutes after selecting a spec? What is missing from the setup instructions or the output of the generator?

    Yes, I was able to generate a pipeline, the setup instructions were very clear. What’s missing IMO is a way to “regenerate” an existing pipeline: at one point while experimenting I decided to add more endpoints to the pipeline afaik the way to do that was to delete the pipeline folder and regenerate the pipeline from scratch.

  3. Is the resulting dlt rest_api source legible? Should it be structured differently or annotated with comments better?

    To me it’s legible because I know the rest_api pretty good. I like how the folder is structure. I also liked the placeholder params. One thing that I miss is the example values for those params (see my raw feedback for details).

  4. Could you run the pipeline after generation? Did it produce some data?

    I tried pokemon and vimeo pipelines. Both fail on the first run. Looks like pokemon was out of sync with the actual API and Vimeo had some issues in generated rest_api dict. (see full description in the raw notes)

  5. If something failed, was the reason for the failure clear? What error message would have been better?

    Generator didn’t fail, resulting pipelines did, so it’s only relevant to the rest_api:

    I understood the errors because I’m familiar with rest_api. However some messages potentially not very clear

    • For pokemon, it crashed with: “In processing pipe pokemon_read: extraction of resource pokemon_read in generator paginate_dependent_resource caused an exception: Transformer expects a field 'id' to be present in the incoming data from resource pokemon_list in order to bind it to path param id. Available parent fields are name, url”
      • Here we call the “child” or “dependent” resource a transformer – to a fresh rest_api user the term “transformer” could be unfamiliar. (we mention this term in the docs, though)
      • also “resove” vs “bind” could be problematic
    • Second error on pokemon: dlt.destinations.exceptions.DatabaseTerminalException: Constraint Error: NOT NULL constraint failed: pokemon.id
      • that’s because of the wrong primary key
        • again, easy for someone with prior dlt experience, not so easy for the rest
    • For the Vimeo pipeline what was missing is a way to see the response body for Bad Request error.
  6. Was anything incorrectly converted from the spec to the rest_api definition although it is clear how it should have been generated? If so, which section and what should have been produced?

    For Vimeo pipeline:

    1. Paginator settings were generated incorrectly (see the raw notes for the details):

      {
        "paginator": {
            "type": "page_number",
            "page_param": "page",
            "total_path": "",
            "maximum_page": 20,
        },
      }

      While the response was:

       {
          "total": 10,
          "page": 1,
          "per_page": 25,
          "paging": {
              "next": null,
              "previous": null,
              "first": "/categories?page=1",
              "last": "/categories?page=1"
          },
          "data": []
      }
      

      Total path is there but was generated explicitly as an empty string.

    2. The “child” resource had an incorrect path

  7. Are there any settings, options, or commands you are missing from the tool?

    No

Raw Notes

  1. It was not easy to pick up relevant API since I need an API with data and with some relevant functionality (e.g. child/parent). But the list with example APIs was very helpful.
  2. I ended up selecting Vimeo: https://github.com/dlt-hub/openapi-specs/blob/main/open_api_specs/vimeo.yaml
  3. Small note: I was missing a hint to press Enter upon selection (in the console, not in the README, or I didn’t notice it) in the cli.
  4. Pokemon crashed. Config:
from typing import List

import dlt
from dlt.extract.source import DltResource

from rest_api import rest_api_source
from rest_api.typing import RESTAPIConfig

@dlt.source(name="pokemon_source", max_table_nesting=2)
def pokemon_source(
    base_url: str = dlt.config.value,
) -> List[DltResource]:

    # source configuration
    source_config: RESTAPIConfig = {
        "client": {
            "base_url": base_url,
            "paginator": {
                "type": "offset",
                "limit": 20,
                "offset_param": "offset",
                "limit_param": "limit",
                "total_path": "count",
            },
        },
        "resources": [
            {
                "name": "pokemon_list",
                "table_name": "pokemon",
                "primary_key": "id",
                "write_disposition": "merge",
                "endpoint": {
                    "data_selector": "results",
                    "path": "/api/v2/pokemon/",
                },
            },
            {
                "name": "pokemon_read",
                "table_name": "pokemon",
                "primary_key": "id",
                "write_disposition": "merge",
                "endpoint": {
                    "data_selector": "$",
                    "path": "/api/v2/pokemon/{id}/",
                    "params": {
                        "id": {
                            "type": "resolve",
                            "resource": "pokemon_list",
                            "field": "id",
                        },
                    },
                },
            },
            {
                "name": "pokemon_color_list",
                "table_name": "pokemon_color",
                "primary_key": "id",
                "write_disposition": "merge",
                "endpoint": {
                    "data_selector": "results",
                    "path": "/api/v2/pokemon-color/",
                },
            },
        ],
    }

    return rest_api_source(source_config)

Stacktrace:

(dlt-openapi-py3.11) bash-3.2$ PROGRESS=enlighten python pipeline.py
2024-05-17 20:21:02,872|[INFO                 ]|60844|4619372032|dlt|client.py|_send_request:128|Making GET request to https://pokeapi.co/api/v2/pokemon-color/ with params={'offset': 0, 'limit': 20}, json=None
2024-05-17 20:21:03,616|[INFO                 ]|60844|4619372032|dlt|client.py|extract_response:240|Extracted data of type list from path results with length 10
2024-05-17 20:21:03,620|[INFO                 ]|60844|4619372032|dlt|client.py|paginate:228|Paginator OffsetPaginator at 108b88a50: current offset: 20 offset_param: offset limit: 20 total_path: count maximum_value: None does not have more pages
2024-05-17 20:21:03,620|[INFO                 ]|60844|4619372032|dlt|client.py|_send_request:128|Making GET request to https://pokeapi.co/api/v2/pokemon/ with params={'offset': 0, 'limit': 20}, json=None
2024-05-17 20:21:04,331|[INFO                 ]|60844|4619372032|dlt|client.py|extract_response:240|Extracted data of type list from path results with length 20

===============================================================================================Extract rest_api_resources================================================================================================
Resources   0%|                                                                                                                                                                                   | 0/2 [00:01<?, 0.00/s]
pokemon_color 10 [00:01, 14.00/s]
Traceback (most recent call last):
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/extract/pipe_iterator.py", line 275, in _get_source_item
    pipe_item = next(gen)
                ^^^^^^^^^
  File "/Users/burnash/projects/dlthub/dlt-openapi/pokemon-pipeline/rest_api/__init__.py", line 311, in paginate_dependent_resource
    formatted_path, parent_record = process_parent_data_item(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/projects/dlthub/dlt-openapi/pokemon-pipeline/rest_api/config_setup.py", line 421, in process_parent_data_item
    raise ValueError(
ValueError: Transformer expects a field 'id' to be present in the incoming data from resource pokemon_list in order to bind it to path param id. Available parent fields are name, url

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 431, in extract
    self._extract_source(extract_step, source, max_parallel_items, workers)
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 1105, in _extract_source
    load_id = extract.extract(source, max_parallel_items, workers)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/extract/extract.py", line 397, in extract
    self._extract_single_source(
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/extract/extract.py", line 326, in _extract_single_source
    for pipe_item in pipes:
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/extract/pipe_iterator.py", line 159, in __next__
    pipe_item = self._get_source_item()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/extract/pipe_iterator.py", line 306, in _get_source_item
    raise ResourceExtractionError(pipe.name, gen, str(ex), "generator") from ex
dlt.extract.exceptions.ResourceExtractionError: In processing pipe pokemon_read: extraction of resource pokemon_read in generator paginate_dependent_resource caused an exception: Transformer expects a field 'id' to be present in the incoming data from resource pokemon_list in order to bind it to path param id. Available parent fields are name, url

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/burnash/projects/dlthub/dlt-openapi/pokemon-pipeline/pipeline.py", line 15, in <module>
    info = pipeline.run(source)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 673, in run
    self.extract(
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 176, in _wrap
    rv = f(self, *args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 162, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 446, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1715970062.860288 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe pokemon_read: extraction of resource pokemon_read in generator paginate_dependent_resource caused an exception: Transformer expects a field 'id' to be present in the incoming data from resource pokemon_list in order to bind it to path param id. Available parent fields are name, url

I’ve change manually resolve param from “id” to “name” and extraction worked.

I think what lacks is some info about the amount of data to be extracted: e.g. in case of pokemons - there were 2k+ objects so a lot of requests.

Having enlighten ad progress is nice, but it always shows 0%.

After loading, pipeline crashed with

Traceback (most recent call last):
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 242, in _wrap_gen
    return (yield from f(self, *args, **kwargs))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/destinations/impl/duckdb/sql_client.py", line 129, in execute_query
    raise outer
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/destinations/impl/duckdb/sql_client.py", line 124, in execute_query
    self._conn.execute(query, db_args)
duckdb.duckdb.ConstraintException: Constraint Error: NOT NULL constraint failed: pokemon.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/load/load.py", line 170, in w_spool_job
    job = client.start_file_load(
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/destinations/impl/duckdb/duck.py", line 165, in start_file_load
    job = super().start_file_load(table, file_path, load_id)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/destinations/insert_job_client.py", line 124, in start_file_load
    job = InsertValuesLoadJob(table["name"], file_path, self.sql_client)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/destinations/insert_job_client.py", line 24, in __init__
    self._sql_client.execute_fragments(fragments)
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 119, in execute_fragments
    return self.execute_sql("".join(fragments), *args, **kwargs)  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/destinations/impl/duckdb/sql_client.py", line 108, in execute_sql
    with self.execute_query(sql, *args, **kwargs) as curr:
  File "/usr/local/Cellar/python@3.11/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 244, in _wrap_gen
    raise self._make_database_exception(ex)
dlt.destinations.exceptions.DatabaseTerminalException: Constraint Error: NOT NULL constraint failed: pokemon.id

====================================================================================================================================================================================================================

The terminal hanged in enlighten. I think I forgot to change primary key to name as well.

I decided to switch to Vimeo API. It was easy to get API key for testing.

I selected two endpoints: /categories and /categories/{category}

The pipeline was generated without any problems.

I decided to first test the auth with curl and asked ChatGPT to generate me a curl a root endpoint. This was successful.

I opened a pipeline file and was pleased to find

"endpoint": {
    "data_selector": "$",
    "path": "/categories",
    "params": {
        # "direction": "FILL_ME_IN", # TODO: fill in query parameter
        # "per_page": "FILL_ME_IN", # TODO: fill in query parameter
        # "sort": "FILL_ME_IN", # TODO: fill in query parameter
    },
},

I don’t know what values for each parameters are available, but found the description int the OpenAPI file.

There was no information about authentication, so when I run the pipeline it crashed with 401 (Unauthorized):

  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 446, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1715972078.628279 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe get_categories: extraction of resource get_categories in generator paginate_resource caused an exception: 401 Client Error: Unauthorized for url: https://api.vimeo.com/categories?page=0

Since I know how rest_api works I edited pipeline files to enable bearer auth for the token I get.

The detector from generator incorrectly detected paginator:

{
  "paginator": {
      "type": "page_number",
      "page_param": "page",
      "total_path": "",
      "maximum_page": 20,
  },
}

Note total_path is an empty string.

While response is has links to next and previous pages + total key:

 {
    "total": 10,
    "page": 1,
    "per_page": 25,
    "paging": {
        "next": null,
        "previous": null,
        "first": "/categories?page=1",
        "last": "/categories?page=1"
    },
    "data": []
}

After I inserted the authentication pipeline failed with:

Traceback (most recent call last):
  File "/Users/burnash/projects/dlthub/dlt-openapi/vimeo-pipeline/pipeline.py", line 15, in <module>
    info = pipeline.run(source)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 673, in run
    self.extract(
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 176, in _wrap
    rv = f(self, *args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 162, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 446, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1715972437.485331 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe get_categories: extraction of resource get_categories in generator paginate_resource caused an exception: 400 Client Error: Bad Request for url: https://api.vimeo.com/categories?page=0

It does not reveal the body of the error, but with curl I was able to see it:

curl -H "Authorization: bearer ... " "https://api.vimeo.com/categories?page=0"
{
    "error": "Page can not be less than one"
}

Sidenote: paginator started from 0 (default): I think it’s an extremely rare case so we’d need to change the default here https://github.com/dlt-hub/dlt/blob/devel/dlt/sources/helpers/rest_client/paginators.py#L222 to 1.

I changed paginator to “json_response” the 401 was gone but a new error appeared:

Traceback (most recent call last):
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 431, in extract
    self._extract_source(extract_step, source, max_parallel_items, workers)
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 1105, in _extract_source
    load_id = extract.extract(source, max_parallel_items, workers)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/extract/extract.py", line 397, in extract
    self._extract_single_source(
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/extract/extract.py", line 326, in _extract_single_source
    for pipe_item in pipes:
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/extract/pipe_iterator.py", line 159, in __next__
    pipe_item = self._get_source_item()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/extract/pipe_iterator.py", line 306, in _get_source_item
    raise ResourceExtractionError(pipe.name, gen, str(ex), "generator") from ex
dlt.extract.exceptions.ResourceExtractionError: In processing pipe get_category: extraction of resource get_category in generator paginate_dependent_resource caused an exception: Transformer expects a field 'uri' to be present in the incoming data from resource get_categories in order to bind it to path param category. Available parent fields are total, page, per_page, paging, data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/burnash/projects/dlthub/dlt-openapi/vimeo-pipeline/pipeline.py", line 15, in <module>
    info = pipeline.run(source)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 673, in run
    self.extract(
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 176, in _wrap
    rv = f(self, *args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 162, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.11/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 446, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1715972655.4589498 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe get_category: extraction of resource get_category in generator paginate_dependent_resource caused an exception: Transformer expects a field 'uri' to be present in the incoming data from resource get_categories in order to bind it to path param category. Available parent fields are total, page, per_page, paging, data

Since I know rest_api I figured that the problem is in explicit data selector on categories which was wrong

{
    "name": "get_categories",
    "table_name": "category",
    "primary_key": "uri",
    "write_disposition": "merge",
    "endpoint": {
        "data_selector": "$", 
        "path": "/categories",
        "params": {
            # "direction": "FILL_ME_IN", # TODO: fill in query parameter
            # "per_page": "FILL_ME_IN", # TODO: fill in query parameter
            # "sort": "FILL_ME_IN", # TODO: fill in query parameter
        },
    },
},

I commented it out to allow for autodetection and did another run. It looked like the list endpoint data was now fetched but the problem now was with child endpoint:


Traceback (most recent call last):
  File "/Users/burnash/projects/dlthub/dlt-openapi/vimeo-pipeline/pipeline.py", line 15, in <module>
    info = pipeline.run(source)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.12/lib/python3.12/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.12/lib/python3.12/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.12/lib/python3.12/site-packages/dlt/pipeline/pipeline.py", line 673, in run
    self.extract(
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.12/lib/python3.12/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.12/lib/python3.12/site-packages/dlt/pipeline/pipeline.py", line 176, in _wrap
    rv = f(self, *args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.12/lib/python3.12/site-packages/dlt/pipeline/pipeline.py", line 162, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.12/lib/python3.12/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/burnash/Library/Caches/pypoetry/virtualenvs/dlt-openapi-g9gJlfBD-py3.12/lib/python3.12/site-packages/dlt/pipeline/pipeline.py", line 446, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1716304188.623159 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe get_category: extraction of resource get_category in generator paginate_dependent_resource caused an exception: 404 Client Error: Not Found for url: https://api.vimeo.com/categories//categories/adsandcommercials

The url looks weird: categories is duplicated in https://api.vimeo.com/categories//categories/adsandcommercials

Note: it’d be great to be able to see an example item data when debugging. I did just that and inserted a pdb into a transformer code at rest_api. When I printed a sample item i found that resolved param is wrong – In the generated config:

{
    "name": "get_category",
    "table_name": "category",
    "primary_key": "uri",
    "write_disposition": "merge",
    "endpoint": {
        "data_selector": "$",
        **"path": "/categories/{category}",**
        "params": {
            "category": {
                "type": "resolve",
                "resource": "get_categories",
                "field": "uri",
            },
        },
    },
},

It is referencing uri field. And this field already has /categories/ prefix:

{'uri': '/categories/adsandcommercials', 'name': 'Ads and Commercials', 'link': ..}

After I changed it to

{
    "name": "get_category",
    "table_name": "category",
    "primary_key": "uri",
    "write_disposition": "merge",
    "endpoint": {
        "data_selector": "$",
        **"path": "{category}",**
        "params": {
            "category": {
                "type": "resolve",
                "resource": "get_categories",
                "field": "uri",
            },
        },
    },
},

Pipeline run successfully. 🚀

sh-rp commented 6 months ago

@burnash I have created a ticket to fix the paginator and jsonpath detection for vimeo: https://github.com/orgs/dlt-hub/projects/12/views/1?pane=issue&itemId=63884181. I did not see any other actionables from your notes, let me know if I missed something that is part of the generator.