dbt-labs / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)
https://dbt-athena.github.io
Apache License 2.0
225 stars 97 forks source link

Attempts to "clean up" s3 in case of Iceberg tables, which is redundant #285

Closed brabster closed 1 year ago

brabster commented 1 year ago

More new behaviour since 1.3.5

For CTAS dbt-athena attempts to delete objects at location in the case of iceberg table. It did not do that before, so this is another breaking change thus requiring new S3 permissions (listing and deleting) if you're using iceberg (which I am, and I don't have the necessary permissions).

This activity seems to be redundant given AWS docs

Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data in the table.

Problematic line appears to be here. I'll PR a fix to remove this call in case of iceberg.

nicor88 commented 1 year ago

@brabster why would you want to remove this Can't you get the right permissions in place?

nicor88 commented 1 year ago

Because Iceberg tables are considered managed tables in Athena, dropping an Iceberg table also removes all the data in the table.

the idea is that we DO not drop the table via drop, but via glue api plus s3 location. The issue is that drop sometimes do not clean the path correctly when you have many objects. We had already few iteration on this so far, and I won't recommend to change this behavior ;)

brabster commented 1 year ago

(I do appreciate some of the challenges with OSS maintenance, I've done a bit of that myself. I'll take a few minutes to try and explain my context and why I'm having problems, but I think it's really more a question for how dbt-athena decides what and how to implement, and how breaking changes are defined and handled. I'll try and find a bit more time to drop my broader thoughts into the governance for the maintainers)

Can't you get the right permissions in place?

I think this might be the crux of the problem - dbt-athena makes the assumption that any user will have access not just to Athena, but also fairly comprehensive access to Glue and S3. I'm sure that's the case for many or even most teams, but it is possible to use Athena in a much more restricted mode, using the CalledVia key to permit access to S3 only when called via Athena, as documented by AWS here.

The idea of restricting IAM permissions as far as possible is indicated by the presence of that documentation specifically in the context of Athena, and the general AWS "best practices" guidance for IAM issued here, specifically:

When you set permissions with IAM policies, grant only the permissions required to perform a task.

I am a consultant working with highly sensitive data for a client, and aside from is being a best practice, I was explicitly asked to minimise the permissions required for teams - with a specific callout for no direct access to S3. Up to dbt-athena 1.3.5, this was not a problem. Almost every release since, the set of required permissions has been extended (I'm sure you're aware of the battery of issues I've raised this week - this is as a result of me trying to get update our dependency on dbt-athena to latest and every time I fix one thing, I just surface the next broken thing.

To reiterate - everything works fine on 1.3.5. These are all breaking changes, if your consumer is operating in a least privilege environment.

Why don't I just update the permissions?

The issue is that drop sometimes do not clean the path correctly when you have many objects. We had already

Have you raised this issue with AWS? If the advertised functionality doesn't work as expected, I'm sure AWS would appreciate getting that info and whatever evidence you have of the problem (as would everyone who uses Athena Iceberg and expects it to work properly like me!)

nicor88 commented 1 year ago

Thanks for your detailed answer.

When we build and maintain new features for the adapter we don't have all the usage clear in mind, therefore giving such context in each issue helps us to understand what lead to it. Often we give our first answer based on common patterns/issues raised in the past - see this case: I assumed that you could simply add a missing IAM action.

Your setup is a rather special one, thanks for making us aware. I apply the "at least privilege rule" in all of my projects, but as often I have admin rights, I can add more policies, this is not the case for you - and for sure other users.

We are aware of changes in the IAM policies, and often we prefer to let the final user add new policies in order to fix a specific issue, generally, this approach had lead us to make most of the users happy, and I'm aware that we won't be able to to make everybody happy.

Have you raised this issue with AWS? If the advertised functionality doesn't work as expected, I'm sure AWS would appreciate getting that info and whatever evidence you have of the problem (as would everyone who uses Athena Iceberg and expects it to work properly like me!)

This implementation is a fruit of a suggestion made by AWS support. I guess that you are aware of the long fixing time of some bugs, hence the proposed fix.

Also, I would like to remind you that this project is a community-based effort, we do maintain in free time and compromise with work/private life. Said so, when we do a change we put "maintainability first", which means we avoid "backward compatibility changes", and convoluted code, as simply we don't have enough resources (time/people) to do so - happy to add you as code maintainer if you are interested.

In this specific case, I believe that we could consider an extra global/model specific flag native_iceberg_drop, if set to True the native athena drop statement will be executed and cleanup of s3 will happen again using athena iceberg capability - by default such flag will be set to False.

WDYT?

cc @Jrmyy

brabster commented 1 year ago

Thanks @nicor88 for taking the time, that sounds like a perfectly good solution - I'll be happy to submit a PR to make that change as per your spec. I am also mid-writing a bit of a timeline and suggestions in the Governance ticket as there's a bit of a multi-issue story - will finish that up and then do that PR :bow:

brabster commented 1 year ago

@nicor88 I have started work on a PR, but I'm wondering whether a better way of approaching the whole problem I've been trying to explain would be to have a single flag like "sql_only" that switches behaviour to avoid other API calls instead of a potentially several flags that toggle bits to be done in SQL rather than API calls.

I think that could be understood as a feature to support the restricted access case which should help avoid confusion. That flag should also be easier to understand and use as well as easier to implement than multiple small independent fixes (although support can be implemented in small PRs not one mega-PR) - I get the feeling the piecemeal approach will end up being more difficult to maintain?

Jrmyy commented 1 year ago

Hello

I overall agree with the discussion and I feel indeed that we can both support an "enhanced" version of the adapter, leveraging S3, Glue and Athena in order to have the maximum capacities of the adapter.

I also think that having a sql_only or athena_only could be a better way to handle things.

This config could be in the profiles file.

I also warn that this might be a noticeable amount of work because we would have to check everything to be sure they are compliant with both approaches. My 2 cents is really to have a shared discussion of what we want to achieve and what complexity it will bring to support several ways to do things. I don't want to be opinionated or impose anything but on the other hand I don't want to over-complexify the adapter if we don't have an audience for that.

What do you think about having an open discussion about it on #db-athena in DBT Slack ? Maybe it will help having more insights.

brabster commented 1 year ago

@Jrmyy I didn't know there was a Slack channel! Could pop that in the README. Would love to have more of a dialogue about this, it's not a trivial thing I'll see you over on Slack :+1: