dbt-labs / dbt-athena

The athena adapter plugin for dbt (https://getdbt.com)

https://dbt-athena.github.io

Apache License 2.0

228 stars 100 forks source link

feat: Implement Python Models Using EMR Serverless #700

Closed sankeerthnagapuri closed 3 months ago

sankeerthnagapuri commented 3 months ago

Description

Athena Spark has certain limitations that can be addressed by using EMR Serverless applications.
Whitelisting VPC/subnets for EMR Serverless is more secure than whitelisting entire aws Athena.
Additionally, EMR Serverless provides a comprehensive set of Spark configurations, offering greater flexibility and capability compared to Athena Spark.
Few models are cost effective with Athena spark and few others are with EMR serverless.
So adding both options to the adapter will help the developers to pick the right submission method.

Models used to test - Optional

Added functional tests and also a sample to README

Checklist

[x] You followed contributing section
[x] You kept your Pull Request small and focused on a single feature or bug fix.
[x] You added unit testing when necessary
[x] You added functional testing when necessary

sankeerthnagapuri commented 3 months ago

Removed Flake8 config to be able to commit the code. Please let me know the best way to get around flake8 errors (i see some not related to this PR errors as well) so I can add the flake8 config back. Thanks

nicor88 commented 3 months ago

As the name of the repository said, this is a dbt-athena adapter, mostly build for athena trino SQL capabilities - except for Python models that runs on athena spark.

I do believe that what you propose it's really relevant, but I'm not in favor of incorporating those changes because:

we don't have a CI setup that allow to properly test your changes - sure we can fix that, but we don't have enough AWS credits to run EMR serverless in our CI
we are mixing athena with EMR, that are 2 different technologies - and we rather want to focus on athena here.

In the future AWS might take over, and might maybe want to create an adapter that allow to use the right engine for the right job - I can envision a dbt-aws adapter where the user can specify the aws engine and the dialect to use in order run the models against the right technology.

Said so, I would like to ask an opinion to the other maintainers @Jrmyy @jessedobbelaere @mattiamatrix @svdimchenko before closing this PR.

jessedobbelaere commented 3 months ago

@nicor88 It was also my first thought. Having EMR will bloat the dbt-athena adapter beyond its primary responsibility. E.g. the dbt-glue job also runs Spark jobs but on Glue (Interactive Sessions), ... dbt-athena runs with Spark workgroups on Athena. But Spark has so many ways to run on AWS... EMR (serverless) is a lot more expensive so indeed the integration tests would burn our credits too.

I saw that Redshift has a connector for spark on EMR but no progress in the dbt-redshift adapter

I'm not sure what's the best path forward: let @sankeerthnagapuri create a separate adapter exclusively for spark-emr?

nicor88 commented 3 months ago

Having a specific adapter for emr / (emr serverless) might be the best option.

I would like to chime in @iconara on this issue to give his point of view.

iconara commented 3 months ago

Hi all, I can only echo @nicor88 and @jessedobbelaere's comments. This is a neat idea, but this is not the right place. Could you reach out to me on Slack (I'm @tolv on https://getdbt.slack.com) and tell me more about how you use EMR and dbt today? We are always working on new features and support for more tools. I especially be interested in knowing how you use Python models with dbt, why EMR-S over Glue.

nicor88 commented 3 months ago

Based on the comment above from @iconara and @jessedobbelaere I'm closing this PR.