rest_api: filter, exclude, transform API responses

francescomucio commented 5 months ago

Tell us what you do here

As per the associated issue, rest_api: Allow the REST API config object to exclude rows, columns, and transform data:

when specifying a resource it will be possible to pass:
- a row_filter, in the form of a function or lambda
- a list of columns (exclude_columns), as jsonpaths, that need to be removed from the output
- a transform, in the form of a function or lambda

I have added the additional properties to typing.py and added the code to handle them.

Related Issues

This solves this issue

Additional Context

burnash commented 4 months ago

@francescomucio thank you for your contribution and for proposing the enhancement to the rest_api configuration. These are valuable suggestions and they can can certainly improve the flexibility of resource configuration.

However, I see a few potential drawbacks with the proposed interface:

Flexibility in defining the precedence of filters and transforms: the current approach of specifying row_filter, exclude_columns, and transform separately might lead to confusion regarding the order of operations. It's crucial to have a clear order of applying these operations to ensure predictable results.
Consistency with existing dlt resource methods: one possible improvement is to align with the existing DltResource's methods add_map() and add_filter(). This could make the configuration more intuitive for users who are already familiar with the dlt. (Most likely the implementation should also be based on add_map() and add_filter() where possilbe.)

To address these issues, I suggest an alternative format that puts all operations in one operations list. This makes it clear and flexible by showing the exact order of operations. Taking the example from #494, e.g.:

{
    "name": "my_nicely_named_resource",
    "endpoint": {
        "path": "endpoint_name"
    },
    "operations": [
        {"filter": lambda x: x["id"] == 3},
        {"map": "delete_fields", "fields": ["id", "another_column"]},
        {
            "map": "rename_fields",  
            "fields": {
                "user_id": "my_user_id",
                "timestamp": "my_timestamp"
            }
        },
        {"map": my_function}
    ]
}

Some other pre-defined possible filters:

{"filter": "range", "field": "date", "from": "2021-01-01", "to": "2021-12-31"}

or

{"filter": "in_set", "field": "category", "values": ["tech", "finance", "health"]}

and so on.

Again, thank you for the suggestion and valuable input. What do you think about this? Looking forward to your feedback.

francescomucio commented 4 months ago

I like the idea, I like that it's possible to transform/map the data before or after a filter and doing both operations multiple times.

To start I would keep it simple, just with map or filter

    "operations": [
        {"filter": lambda x: x["id"] == 3},
        {"map": my_function}
    ]

and then extend with additional functionalities, each with its own key word (nothing that cannot be handled passing a callable, just as sugar for the developers):

    "operations": [
        {"delete_fields": ["id", "another_column"]},
        {"rename_fields": {
                "user_id": "my_user_id",
                "timestamp": "my_timestamp"
            }
        }
    ]

What do you think?

rudolfix commented 4 months ago

Overall this looks good

We call those functions transforms not operations. Maybe we could rename to transform?
The structure as a list of functions is OK
I agree with Francesco: let's start with python functions. If we want to implement a few standard transformations we keep them here: dlt.sources.helpers.transform I'm totally for adding functions to rename or delete fields
If we do (3) we can skip discussion how to declare functions with arguments in dicts/yaml which we then should reuse everywhere

burnash commented 3 months ago

I'm not very firm on "operations", but I was looking for a keyword that could work for both "map" and "filter". In my opinion "transform" is could be a bit specific so it's hard to fit "filter" under it. But I may be wrong. So the alternatives to "operations" could be:

"transform" / "transformations"
"processing_steps" to better align with underlying API.

Other than that also agreed with @francescomucio we can start with Python functions.

francescomucio commented 3 months ago

I went with processing_steps, I have also added an initial test case for a filter, I will add one for a map and maybe something a bit more complex. Also probably a test to remove columns could be nice :)

@burnash @willi-mueller please take a look at it

dlt-hub / verified-sources