Closed francescomucio closed 3 months ago
@francescomucio thank you for your contribution and for proposing the enhancement to the rest_api
configuration. These are valuable suggestions and they can can certainly improve the flexibility of resource configuration.
However, I see a few potential drawbacks with the proposed interface:
Flexibility in defining the precedence of filters and transforms: the current approach of specifying row_filter
, exclude_columns
, and transform
separately might lead to confusion regarding the order of operations. It's crucial to have a clear order of applying these operations to ensure predictable results.
Consistency with existing dlt resource methods: one possible improvement is to align with the existing DltResource's methods add_map()
and add_filter()
. This could make the configuration more intuitive for users who are already familiar with the dlt
. (Most likely the implementation should also be based on add_map() and add_filter() where possilbe.)
To address these issues, I suggest an alternative format that puts all operations in one operations
list. This makes it clear and flexible by showing the exact order of operations. Taking the example from #494, e.g.:
{
"name": "my_nicely_named_resource",
"endpoint": {
"path": "endpoint_name"
},
"operations": [
{"filter": lambda x: x["id"] == 3},
{"map": "delete_fields", "fields": ["id", "another_column"]},
{
"map": "rename_fields",
"fields": {
"user_id": "my_user_id",
"timestamp": "my_timestamp"
}
},
{"map": my_function}
]
}
Some other pre-defined possible filters:
{"filter": "range", "field": "date", "from": "2021-01-01", "to": "2021-12-31"}
or
{"filter": "in_set", "field": "category", "values": ["tech", "finance", "health"]}
and so on.
Again, thank you for the suggestion and valuable input. What do you think about this? Looking forward to your feedback.
I like the idea, I like that it's possible to transform/map the data before or after a filter and doing both operations multiple times.
To start I would keep it simple, just with map or filter
"operations": [
{"filter": lambda x: x["id"] == 3},
{"map": my_function}
]
and then extend with additional functionalities, each with its own key word (nothing that cannot be handled passing a callable, just as sugar for the developers):
"operations": [
{"delete_fields": ["id", "another_column"]},
{"rename_fields": {
"user_id": "my_user_id",
"timestamp": "my_timestamp"
}
}
]
What do you think?
Overall this looks good
transforms
not operations
. Maybe we could rename to transform
?dlt.sources.helpers.transform
I'm totally for adding functions to rename or delete fieldsI'm not very firm on "operations", but I was looking for a keyword that could work for both "map" and "filter". In my opinion "transform" is could be a bit specific so it's hard to fit "filter" under it. But I may be wrong. So the alternatives to "operations" could be:
Other than that also agreed with @francescomucio we can start with Python functions.
I went with processing_steps
, I have also added an initial test case for a filter, I will add one for a map and maybe something a bit more complex. Also probably a test to remove columns could be nice :)
@burnash @willi-mueller please take a look at it
Tell us what you do here
As per the associated issue, rest_api: Allow the REST API config object to exclude rows, columns, and transform data:
row_filter
, in the form of a function or lambdaexclude_columns
), as jsonpaths, that need to be removed from the outputtransform
, in the form of a function or lambdaI have added the additional properties to typing.py and added the code to handle them.
Related Issues
This solves this issue
Additional Context