apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.18k stars 2.15k forks source link

AWS: provide option to hide old fields in Glue table #7584

Open LucasRoesler opened 1 year ago

LucasRoesler commented 1 year ago

Feature Request / Improvement

In https://github.com/apache/iceberg/pull/3888 the Glue schema generation was adjusted so that all old fields are included in the schema. The original reasoning was

so that people know what were the columns that were already used in the past and avoid adding the same name column.

In my organization, there are many users of these tables via Athena who are not data engineers that own the schema. They no idea about the old schema, they are not editing the schema, and their default use case is querying the current data. They report it as confusing that the schema shows a field that does not exist and produces errors if they attempt to use it.

Neither Athena nor Glue seem to have any support to display these old fields as non-active or deprecated or to hide these fields. Therefore, it would be nice to have a configuration option to disable including non-current fields in the schema.

Query engine

Athena

dertodestod commented 1 year ago

I also don't quite understand the current behavior in Athena/Glue when a column is dropped. I can see that a new schema is created in the metadata file without the column and in Glue the column moves to the end of the table and gets a "iceberg.field.current": "false" setting. However, the column still shows up for consumers in Athena web console (but not when doing a DESCRIBE of the table) so this has led to some confusion in our business.

I couldn't check if the column appears via JDBC (because of some errors) but I guess the column won't be listed because I see in Athena that a DESCRIBE query is used to retrieve that information. Can someone confirm that?

I personally think that Athena should not show the deleted columns (neither in the web nor via JDBC). Is there perhaps a way to keep track of the dropped column(s) without showing them in Athena? If not, it would be great if one could be created.

wojciechjak commented 1 year ago

Also the same issue when renaming columns.

pdehaansbp commented 1 year ago

Same issue. Curious to read what @jackye1995 and @yyanyy think about it.

tcassou commented 8 months ago

Hello! Our organization is facing the same problem. In particular, the Glue API will return columns that cannot be resolved in the source data, causing queries to fail. We've been using Presto views created dynamically, and breaking every time a column is dropped.

Technically, schema versioning is meant to solve this challenge:

so that people know what were the columns that were already used in the past and avoid adding the same name column.

The latest schema of a table should be aligned with the data, and previous versions will keep track of historical modifications. Could we think of publishing new schema versions in Glue instead of this workaround that introduced bugs/defects? Or at the very least making this newly introduced behavior optional?

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

tcassou commented 2 weeks ago

Hi there! This is still an issue, and the only workaround we found is to build a custom Iceberg jar without the faulty commit which is not really sustainable of course. Any change this could get prioritized, or even just acknowledged to start with?

Thanks a lot!