elementary-data / elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
https://www.elementary-data.com/
Apache License 2.0
1.9k stars 159 forks source link

AWS Athena integration reengineering #1698

Open svdimchenko opened 2 weeks ago

svdimchenko commented 2 weeks ago

Is your feature request related to a problem? Please describe. Currently I'm using aws athena as my query engine for dbt transformations. The problem with integrating elementary is following:

Describe the solution you'd like There are several possible solutions I can offer to solve the issue:

  1. Implement partitioning for elementary tables and utilise partition fields in monitoring models. Unfortunately, we can not use created_at field with hive table format. So that we'll need to add created_at_date field and utilise it for partition pruning.

  2. Implement possibility to load dbt artifacts to separate backend. For instance, it can be AWS RDS. Currently, elementary loads data from dbt context and there is no possibility to work with dbt's json files: run_results.json, manifest.json etc. Here is datahub's example how json files can be ingested into external database.

Describe alternatives you've considered As a quick workaround I can keep elementary tables in hive format and setup s3 bucket lifecycle policy to remove outdated elementary's data, but such approach requires accurate s3 bucket tuning for every specific elementary's table which can be tricky.

Would you be willing to contribute this feature? Once we clarify the most appropriate way of athena integration, I can contribute of course.

ofek1weiss commented 1 day ago

Hey @svdimchenko A workaround that might work for you is to separate the artifact uploading to a different job, this can be done by doing the following:

This will make sure that not all the metadata is uploaded after every job (avoiding the parallel uploading), but is still up to date.

Let me know if this helps 🙏