GH-3035: ParquetRewriter: Add a column renaming feature

MaxNevermind commented 1 month ago

Rationale for this change

This feature extension is based on a real use-case when a input parquet dataset need to transformed to a new one using a set of basic schema transformations. ParquetRewriter already supports some of transformations: pruning, masking, encrypting, changing a codec. This PR add one of missing transformations - renaming.

What changes are included in this PR?

Add a new option renameColumns to RewriteOptions class which is options builder for ParquetRewriter
Add support of to renameColumns to ParquetRewriter

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

MaxNevermind commented 1 month ago

@wgtmac Can you check this out? I need your opinion - If you see any problems with such a feature conceptually. Then I can polish the implementation. Right now there is a single test for a new feature and the tests for more complex cases with a mix of options are missing. Will work on it after you green light the feature.

MaxNevermind commented 4 weeks ago

What is your use case?

We have a large base dataset which is split into multiple versions at the end of the flow. Those final datasets have majority of columns overlapping but some columns are dropped & some columns are renamed. Similar thing can be achieved by multiple HMS views on the top of base HMS table but it is not always possible, for example if different users are supposed to have access just to a single version of a dataset, it is hard to achieve that with HMS views without giving access to users to that underlying base table.

Do you need reordering fields in the future?

In our case we don't care about column ordering.

How does renaming work with other features, including join, mask, and encrypt?

mask - it is not supported as it seems meaningless to rename a column that is dropped, I want to throw exception if that is detected, not entirely sure about that though, does it make sense to allow nullification + renaming? 🤷
join - I planned to support it, so the column could be joined and then renamed if needed encrypt - I planned to support it, so the column can be encrypted and renamed if needed

wgtmac commented 4 weeks ago

Thanks for the explanation! I asked for column ordering because renaming and reordering may happen together in the case of schema evolution. We can support them in the future when needed.

does it make sense to allow nullification + renaming

I am also skeptical of this case. But it seems to be valid if one wants to nullify a renamed column which has sensitive data?

MaxNevermind commented 3 weeks ago

I asked for column ordering because renaming and reordering may happen together in the case of schema evolution. We can support them in the future when needed.

In case of column reordering there is a need to provide a new order of column, correct? If I would be implementing it I guess I would create a RewriteOptions's option like Map<SrcColName, Map.Entry<DistColName, DestColIdx>> columnsReordered which would have merged capabilities of both renaming and reordering. I think current implementation can be extended to something like that in the future, for example if columnsReordered is used we should prohibit renameColumns option usage, wdyt?

I am also skeptical of this case. But it seems to be valid if one wants to nullify a renamed column which has sensitive data?

I've just checked MaskMode enum, it has planned hashing feature as I understand. Renaming hashed column would definitely should be beneficial to support. Okay, I think we should try to support renaming of masked columns.

MaxNevermind commented 3 weeks ago

@wgtmac In general do you think this feature looks worth adding, no hard no-go traits in this feature? If yes, I will work on extending functionality to those overlapping cases with masking and will start polishing it.

wgtmac commented 3 weeks ago

Yes, I think this is a fair use case, provided that the code logic does not change a lot.

MaxNevermind commented 2 weeks ago

@wgtmac I think I'm done with implementation, can you take a look?

apache / parquet-java