dlt-hub / verified-sources

Contribute to dlt verified sources 🔥
https://dlthub.com/docs/walkthroughs/add-a-verified-source
Apache License 2.0
72 stars 50 forks source link

rest_api: filter, exclude, transform API responses #495

Closed francescomucio closed 3 months ago

francescomucio commented 5 months ago

Tell us what you do here

As per the associated issue, rest_api: Allow the REST API config object to exclude rows, columns, and transform data:

I have added the additional properties to typing.py and added the code to handle them.

Related Issues

This solves this issue

Additional Context

burnash commented 4 months ago

@francescomucio thank you for your contribution and for proposing the enhancement to the rest_api configuration. These are valuable suggestions and they can can certainly improve the flexibility of resource configuration.

However, I see a few potential drawbacks with the proposed interface:

  1. Flexibility in defining the precedence of filters and transforms: the current approach of specifying row_filter, exclude_columns, and transform separately might lead to confusion regarding the order of operations. It's crucial to have a clear order of applying these operations to ensure predictable results.

  2. Consistency with existing dlt resource methods: one possible improvement is to align with the existing DltResource's methods add_map() and add_filter(). This could make the configuration more intuitive for users who are already familiar with the dlt. (Most likely the implementation should also be based on add_map() and add_filter() where possilbe.)

To address these issues, I suggest an alternative format that puts all operations in one operations list. This makes it clear and flexible by showing the exact order of operations. Taking the example from #494, e.g.:

{
    "name": "my_nicely_named_resource",
    "endpoint": {
        "path": "endpoint_name"
    },
    "operations": [
        {"filter": lambda x: x["id"] == 3},
        {"map": "delete_fields", "fields": ["id", "another_column"]},
        {
            "map": "rename_fields",  
            "fields": {
                "user_id": "my_user_id",
                "timestamp": "my_timestamp"
            }
        },
        {"map": my_function}
    ]
}

Some other pre-defined possible filters:

{"filter": "range", "field": "date", "from": "2021-01-01", "to": "2021-12-31"}

or

{"filter": "in_set", "field": "category", "values": ["tech", "finance", "health"]}

and so on.

Again, thank you for the suggestion and valuable input. What do you think about this? Looking forward to your feedback.

francescomucio commented 4 months ago

I like the idea, I like that it's possible to transform/map the data before or after a filter and doing both operations multiple times.

To start I would keep it simple, just with map or filter

    "operations": [
        {"filter": lambda x: x["id"] == 3},
        {"map": my_function}
    ]

and then extend with additional functionalities, each with its own key word (nothing that cannot be handled passing a callable, just as sugar for the developers):

    "operations": [
        {"delete_fields": ["id", "another_column"]},
        {"rename_fields": {
                "user_id": "my_user_id",
                "timestamp": "my_timestamp"
            }
        }
    ]

What do you think?

rudolfix commented 4 months ago

Overall this looks good

  1. We call those functions transforms not operations. Maybe we could rename to transform?
  2. The structure as a list of functions is OK
  3. I agree with Francesco: let's start with python functions. If we want to implement a few standard transformations we keep them here: dlt.sources.helpers.transform I'm totally for adding functions to rename or delete fields
  4. If we do (3) we can skip discussion how to declare functions with arguments in dicts/yaml which we then should reuse everywhere
burnash commented 3 months ago

I'm not very firm on "operations", but I was looking for a keyword that could work for both "map" and "filter". In my opinion "transform" is could be a bit specific so it's hard to fit "filter" under it. But I may be wrong. So the alternatives to "operations" could be:

  1. "transform" / "transformations"
  2. "processing_steps" to better align with underlying API.

Other than that also agreed with @francescomucio we can start with Python functions.

francescomucio commented 3 months ago

I went with processing_steps, I have also added an initial test case for a filter, I will add one for a map and maybe something a bit more complex. Also probably a test to remove columns could be nice :)

@burnash @willi-mueller please take a look at it