dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.28k stars 148 forks source link

Feature request: support for Yaml column renaming #585

Open adrianbr opened 1 year ago

adrianbr commented 1 year ago

A user already expressed interest in this feature for renaming columns coming from apis as hashes.

This would be an easy way for a user to reconfigure field names already listed to them, instead of them having to figure out the names upfront in python.

This is an overlap with the previously discussed name mapping via the schema for long column names for databases with short names support such as Postgres.

dapomeranz commented 1 year ago

image As a novice dlt user, this is a feature I was expecting and how I thought it might be implemented. I would strongly consider adding this for table names as well. Many services have unique identifiers for constructs which may have their actual name as renamable objects. Airtable, Google Sheets, etc.

I think there are other ideas for implementation. Any would be good.

rudolfix commented 1 year ago

@dapomeranz @adrianbr we'll go for Python for now. renaming directly in yaml requires a full lineage data to be kept - in essence dlt would need to generate unique id for every named entity in schema (tables and columns) based on the names in source (ie. API endpoints and filed names in json). if we are able to map each single data item into an entity in schema via separate automatic id, the names are free to be changed. until now they function as ids.

we have plenty of ticket requesting the behavior above. this will be quite a big implementation step

what I plan for now

  1. renaming of the table will be released on Monday (resource.table_name ="xxx")
  2. renaming of columns: what I'm missing is some really neat interface to do that. ie resource.rename_columns(list of mappings)

do you think it makes sense to do (2) it in Python? any ideas for better interface?

dapomeranz commented 1 year ago
  1. If it is eventually going to be possible in the schema, then maybe it makes sense to delay this effort in favor of waiting. I can't speak very well to prioritites. I do think this is an important feature but maybe if it was easier to chain a dbt transformation immediately after running a pipeline then this feature isn't as necessary.
  2. If we do want to implement it now, I imagine a parameter for the pipeline could just be a dictionary for table names. Then in processing the pipeline: if table name exists in dictionary, use the value from that dictionary instead of the table name It is a simple implementation but should be effective for the problem at hand.