Closed MaxNevermind closed 1 week ago
@wgtmac Can you check this out? I need your opinion - If you see any problems with such a feature conceptually. Then I can polish the implementation. Right now there is a single test for a new feature and the tests for more complex cases with a mix of options are missing. Will work on it after you green light the feature.
What is your use case?
We have a large base dataset which is split into multiple versions at the end of the flow. Those final datasets have majority of columns overlapping but some columns are dropped & some columns are renamed. Similar thing can be achieved by multiple HMS views on the top of base HMS table but it is not always possible, for example if different users are supposed to have access just to a single version of a dataset, it is hard to achieve that with HMS views without giving access to users to that underlying base table.
Do you need reordering fields in the future?
In our case we don't care about column ordering.
How does renaming work with other features, including join, mask, and encrypt?
mask - it is not supported as it seems meaningless to rename a column that is dropped, I want to throw exception if that is detected, not entirely sure about that though, does it make sense to allow nullification + renaming? 🤷
join - I planned to support it, so the column could be joined and then renamed if needed
encrypt - I planned to support it, so the column can be encrypted and renamed if needed
Thanks for the explanation! I asked for column ordering
because renaming and reordering may happen together in the case of schema evolution. We can support them in the future when needed.
does it make sense to allow nullification + renaming
I am also skeptical of this case. But it seems to be valid if one wants to nullify a renamed column which has sensitive data?
I asked for
column ordering
because renaming and reordering may happen together in the case of schema evolution. We can support them in the future when needed.
In case of column reordering there is a need to provide a new order of column, correct? If I would be implementing it I guess I would create a RewriteOptions's option like Map<SrcColName, Map.Entry<DistColName, DestColIdx>> columnsReordered
which would have merged capabilities of both renaming and reordering. I think current implementation can be extended to something like that in the future, for example if columnsReordered
is used we should prohibit renameColumns
option usage, wdyt?
I am also skeptical of this case. But it seems to be valid if one wants to nullify a renamed column which has sensitive data?
I've just checked MaskMode
enum, it has planned hashing feature as I understand. Renaming hashed column would definitely should be beneficial to support. Okay, I think we should try to support renaming of masked columns.
@wgtmac In general do you think this feature looks worth adding, no hard no-go traits in this feature? If yes, I will work on extending functionality to those overlapping cases with masking and will start polishing it.
Yes, I think this is a fair use case, provided that the code logic does not change a lot.
@wgtmac I think I'm done with implementation, can you take a look?
Rationale for this change
This feature extension is based on a real use-case when a input parquet dataset need to transformed to a new one using a set of basic schema transformations.
ParquetRewriter
already supports some of transformations: pruning, masking, encrypting, changing a codec. This PR add one of missing transformations - renaming.What changes are included in this PR?
renameColumns
toRewriteOptions
class which is options builder forParquetRewriter
renameColumns
toParquetRewriter
Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.