dlt-hub / dlt-init-openapi

MIT License
20 stars 3 forks source link

Hackaton: Sultan #47

Closed sultaniman closed 6 months ago

sultaniman commented 6 months ago

This PR contains two generated rest api sources pollenrapporten, fakerestapi and marvel

For the first two APIs the generator worked out pretty much out of the box with minor naming adjustments and sub-resource selection adjustments.

Marvel API spec required much more work to get it working, below you can find errors and issues I overcame as I implemented it.

Related PR with mentioned specs https://github.com/dlt-hub/dlt-init-openapi/pull/42

TODO

Created issues

https://github.com/dlt-hub/dlt/issues/1388

Notes

With marvel API spec it generated more than selected endpoints And resource names for different endpoints were duplicated https://github.com/dlt-hub/dlt-openapi/assets/354868/2c4e4eab-19b4-4229-a146-0fdfabb088bc


When auth strategy is APIKey then in the logs we get something like:

Making GET request to http://gateway.marvel.com/v1/public/characters with params={'ts': 1716293455, 'hash': '19b8d956c8ed7795530ca7b28ce99cdd', 'offset': 0, 'limit': 20}

Is it intentionally excluding api_key query parameter (probably @burnash is interested in this)?

Sample config:

source_config: RESTAPIConfig = {
    "client": {
        "base_url": base_url,
        "paginator": {
            "type": "offset",
            "limit": 20,
            "offset_param": "offset",
            "limit_param": "limit",
            "total_path": "",
            "maximum_offset": 20,
        },
        "auth": APIKeyAuth(
            api_key=public_key,
            name="apikey",
            location="query",
        ),
    },
    "resource_defaults": {
        "endpoint": {
            "params": auth_params,
        }
    },
}

Resource defaults are not respected

  1. For sub-resources it fetches 5-10 records then fails with the exception below,,
  2. For parent resource has the similar behavior as in #1
Traceback (most recent call last):
  File "/Users/sultan/Projects/DLT/dlt-openapi/hackathon/marvel/pipeline.py", line 15, in <module>
    info = pipeline.run(source)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/sultan/Projects/DLT/dlt-openapi/.venv/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sultan/Projects/DLT/dlt-openapi/.venv/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sultan/Projects/DLT/dlt-openapi/.venv/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 673, in run
    self.extract(
  File "/Users/sultan/Projects/DLT/dlt-openapi/.venv/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sultan/Projects/DLT/dlt-openapi/.venv/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 176, in _wrap
    rv = f(self, *args, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sultan/Projects/DLT/dlt-openapi/.venv/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 162, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sultan/Projects/DLT/dlt-openapi/.venv/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 267, in _wrap
    return f(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sultan/Projects/DLT/dlt-openapi/.venv/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 446, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1716294336.313966 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe get_character_individual: extraction of resource get_character_individual in generator paginate_dependent_resource caused an exception: 409 Client Error: Conflict for url: http://gateway.marvel.com/v1/public/characters/1011334?apikey=KEY

Incorrect pagination detection in generator for marvel source get_comics_collection resource, selects startYear although it is a regular offset-limit pagination (imo we need pagination priority list or something like it)

- description: Return only issues in series whose start year matches the input.
  in: query
  name: startYear
  required: false
  schema:
    format: int32
    type: integer
...SKIPPED...
- description: Limit the result set to the specified number of resources.
  in: query
  name: limit
  required: false
  schema:
    format: int32
    type: integer
- description: Skip the specified number of resources in the result set.
  in: query
  name: offset
  required: false
  schema:
    format: int32
    type: integer

Missing query params in params dictionary. /v1/public/comics accepts startYear parameter, it presents in spec but wasn't added in the generator.

Generated parameters

{
    "name": "get_comics_collection",
    "table_name": "comics",
    "endpoint": {
        "data_selector": "$",
        "path": "/v1/public/comics",
        "params": {
            # "format": "FILL_ME_IN", # TODO: fill in query parameter
            # "formatType": "FILL_ME_IN", # TODO: fill in query parameter
            # "noVariants": "FILL_ME_IN", # TODO: fill in query parameter
            # "dateDescriptor": "FILL_ME_IN", # TODO: fill in query parameter
            # "dateRange": "FILL_ME_IN", # TODO: fill in query parameter
            # "title": "FILL_ME_IN", # TODO: fill in query parameter
            # "titleStartsWith": "FILL_ME_IN", # TODO: fill in query parameter
            # "issueNumber": "FILL_ME_IN", # TODO: fill in query parameter
            # "diamondCode": "FILL_ME_IN", # TODO: fill in query parameter
            # "digitalId": "FILL_ME_IN", # TODO: fill in query parameter
            # "upc": "FILL_ME_IN", # TODO: fill in query parameter
            # "isbn": "FILL_ME_IN", # TODO: fill in query parameter
            # "ean": "FILL_ME_IN", # TODO: fill in query parameter
            # "issn": "FILL_ME_IN", # TODO: fill in query parameter
            # "hasDigitalIssue": "FILL_ME_IN", # TODO: fill in query parameter
            # "modifiedSince": "FILL_ME_IN", # TODO: fill in query parameter
            # "creators": "FILL_ME_IN", # TODO: fill in query parameter
            # "characters": "FILL_ME_IN", # TODO: fill in query parameter
            # "series": "FILL_ME_IN", # TODO: fill in query parameter
            # "events": "FILL_ME_IN", # TODO: fill in query parameter
            # "stories": "FILL_ME_IN", # TODO: fill in query parameter
            # "sharedAppearances": "FILL_ME_IN", # TODO: fill in query parameter
            # "collaborators": "FILL_ME_IN", # TODO: fill in query parameter
            # "orderBy": "FILL_ME_IN", # TODO: fill in query parameter
            # "offset": "FILL_ME_IN", # TODO: fill in query parameter
        },
    },
}

For some incrementals which use only parts of date like year, we don't have datetime formatting support, for example in the marvel source it has startYear parameter which could use the value from modified field from response/spec but it is a datetime value 2019-08-21T17:11:27-0400

- description: Return only issues in series whose start year matches the input.
  in: query
  name: startYear
  required: false
  schema:
    format: int32
    type: integer

Can we also detect and extract common paginators found across the endpoints in resource_defaults maybe this is to @sh-rp?


Documentation

Documentation regarding adding custom pagination, sub-resource configuration (we need more explanation on JSONPath selection data_selector and flavor we use, maybe give links to some reference) and custom authentication implementation could be more detailed, for example in the marvel source it is basically APIKey strategy but we need to pass addition query parameters like timestamp and a hash sum of keys.

sh-rp commented 6 months ago

@sultaniman thanks for your feedback. Notes and tickets: