dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.74k stars 1.61k forks source link

Impala support #1676

Closed boristyukin closed 4 years ago

boristyukin commented 5 years ago

Support Apache Impala. Apache Impala is a widely distributed engine, used by thousands of enterprise around the world.

boristyukin commented 5 years ago

How difficult it would be to implement? I know Python to be dangerous but I am not a daily Python developer, only use it occasionally.

drewbanin commented 5 years ago

Hey @boristyukin - have you seen the docs on building a new adapter? If you're looking for inspiration, you can find some similar adapter plugins here:

There are broadly two things you need to do when building a new adapter:

I definitely recommend checking out the links above - they should give you a good feel for what's involved here! Please don't hesitate to let me know if you have any questions :)

boristyukin commented 5 years ago

thanks @drewbanin! looking now...

ghaskell44 commented 5 years ago

Hi, @drewbanin I work with @boristyukin and have been looking at dbt for a bit with respect to perhaps implementing an adapter for Impala. I've reviewed the doc and looked at some of the source and have a fundamental question. Given the current design and implementation of dbt, is schema a required attribute for dbt supported databases? (e.g. "schema.table_name")

The folks who designed Impala opted not to implement it this way, and instead only support database.table_name. There's no notion of schema in the sense that you might have with postgres, for instance.

Before I head down a rabbit hole, I thought I'd check with you to see if this would be possible. I tried hacking up some macros for Impala that basically ignored schema in favor of relation.identifer and while a basic "select xxxx from yyyy" type model works fine, anything more advanced than that starts throwing errors due to schema being expected.

Any thoughts?

Thanks!

drewbanin commented 5 years ago

Hey @ghaskell44! That's really cool, I think Impala is a great target for a dbt database plugin. dbt does generally assume that the databases it works with will have a notion of a database + schema, but I think there's a way to work around that.

We built a plugin for SparkSQL which also does not have a proper notion of "schemas". On Spark, schema is just an alias for database.

We worked around this by making both the database and schema properties required in the Credentials contract, but using some clever logic to use the supplied schema value as the database (if a database config was not provided). The solution on Impala might look a little different, but you should just be able to supply a phony value for the schema I think.

You may also want to set the include_policy for the schema to False. This should cause dbt to render out Relations with <database>.<identifier> instead of <database>.<schema>.<identifier>.

In general, feel free to peruse the Spark plugin and let me know if you have any questions! I think it should account for many of the implementation challenges that you'll see on Impala.

ghaskell44 commented 5 years ago

Thanks, @drewbanin! That worked great. I went ahead and set database to False in the include_policy just like Spark since Impala treats database and schema the same, then just used schema in the macros. I think the main thing I was missing was the include_policy but I also added the "clever logic" bit and removed dbname from the profile. My simple models that were failing before are now working correctly.

Thanks again for the pointers. If I get something working that looks full-featured, I'll put it up on GitHub.

drewbanin commented 4 years ago

closing this one - out of scope for core

@ghaskell44 were you able to get something working? If so, would love to link out to it in the documentation!

boristyukin commented 4 years ago

we had some roadblocks unfortunately and went a different route with a different tool that already supports Impala. sorry

ynouri commented 3 years ago

Hi @boristyukin , Impala user here too, and interested in adopting dbt. May I ask what tool you ended up going for?

boristyukin commented 3 years ago

Hi @boristyukin , Impala user here too, and interested in adopting dbt. May I ask what tool you ended up going for?

hey @ynouri, we ended up building custom processor in NiFi, that can pick up SELECT statements from files and persist them into tables. Worked quite well for our needs. We could not make Impala work with dbt unfortunately

antoniivanov commented 3 years ago

Hi @boristyukin , yet another Impala user here too, we are considering writing an adapter for Impala for DBT as we are looking to adopt dbt. I was wondering if you recall what were challenges your team hit with it. Thanks!

boristyukin commented 3 years ago

Hi @boristyukin , yet another Impala user here too, we are considering writing an adapter for Impala for DBT as we are looking to adopt dbt. I was wondering if you recall what were challenges your team hit with it. Thanks!

sorry @tozka I do not remember exactly the challenges we had, but after spending a week or so we gave up because we also had NiFi and we built something custom that worked great for us. I need to say we are not daily Python developers so you might have better luck. We built custom NiFi processors that would pick up queries defined as SELECT statements in files and persist them into tables. Obviously it was not as future rich as dbt but got the job done :) and we added some impala specific steps like optional rebuild of stats, fast switching of production tables using LOAD INLINE and etc.

tovganesh commented 2 years ago

Hi, We now have a working version of dbt-imapla adapter at: https://github.com/cloudera/dbt-impala Please try out and let us know your feedback.