DataJunction / dj

A metrics platform.
http://datajunction.io
MIT License
29 stars 13 forks source link

Passing authorization Parameters : Trino #1064

Open bazooka720 opened 1 week ago

bazooka720 commented 1 week ago

We use trino (interact via REST API of trino) and trying to understand how we can pass on params like user name/pwd (or other auth params) and other params trino support when sending query to execute. Below is link to examples but it was not clear how it would work? (Trino through ranger serves as our RBAC ) https://github.com/bazooka720/dj/tree/main/datajunction-query#executing-queries

samredai commented 1 week ago

Hi @bazooka720! The query service is a pluggable piece of the architecture (talked about a little in the docs here), meaning you can implement the query service however you'd like including any sort of authorization. Then the core API will connect to it using any dependency injected query service client that has the expected spec (see here). So a typical enterprise setup today would use the pure OSS version of the core datajunction service with an internal custom query service and query service client that plugs into the core service.

I'd really like to see a robust OSS offering of a query service that's fully integrated to work with Trino that people can just configure, deploy, and use out of the box, but I don't think we're there yet. Let me know if you want to collaborate on something like that!

bazooka720 commented 1 week ago

Thanks @samredai. Just to confirm, currently you don't support Trino? (In the picture i saw trino but was not sure). Is IBIS a potential option for integrating with Trino

samredai commented 1 week ago

To be completely accurate, the query service implementation that exists in the repo today does support sqlalchemy and allows you to pass in any sqlalchemy uri with any additional parameters to the sqlalchemy engine initialization. Therefore you could totally use the sqlalchemy connector that comes with the Trino python client and make that work. I've made a note to document this more clearly and add it to the repo.

What I was really trying to get at is that there is still work to be done to make it a richer experience. For example, you mentioned you're using RBAC via Ranger which implies you'll need user info to flow through to the queries submitted to the query service. We support plugging in authentication to the main service but the OSS query service does not have any support for various authorization schemes. The OSS query service assumes all queries will be submitted to Trino using a generic set of credentials that were provided for initializing the sqlalchemy engine instance.

That's an interesting suggestion to use ibis, I think I see where you're going there. We could then support SSO for Trino. It would still require some changes to the project to make that work but I think it would mostly be small changes.

bazooka720 commented 1 week ago

I see now. New to DJ, I really love the approach (albeit i have so many questions and trying to understand how it's all coming together). Regarding the authorization, yes...we would ideally need more of dynamic approach as our current interaction with Trino itself is via its REST API and every query sent is authorized. Is this something feasible? One alternative for us is to deploy via isolated serverless may be?

samredai commented 1 week ago

Isolated serverless would probably be slightly easier than implementing and maintaining your own query service but it doesn't solve the main issue which is that we don't include the user in the request to the query service (unless you intend to automatically inject that into the header). This should be a small change since we already have current_user in scope when we make that call. Then yes, presumably you could just instantiate a client for that particular user in the serverless function and the rest of the authorization flow would take effect!

It's also worth mentioning that you'd need a gateway that matches the spec for the query service API but that's only 3 simple endpoints--submit a query, get a query by id, and get column metadata for a specific table.

bazooka720 commented 1 week ago

We shall try and get back

shangyian commented 1 week ago

Hi @bazooka720! If you're interested in implementing an internal query service API that matches the spec expected by the core DJ API, you can follow the spec for these two API endpoints in the OSS query service:

How you do either of those tasks is up to you, and the Trino authentication + querying piece can certainly be implemented with Ibis. In fact, if you're up for contributing to the OSS query service and want to use Ibis, that would be great. Unfortunately we don't have full support for Trino at the moment.

shangyian commented 1 week ago

Another thing I would add is that if you are running this in conjunction with the core DJ API, you should deploy the API with a SQL transpilation library so that the queries generated by DJ can always be transpiled correctly to the Trino dialect. This is done by running the server with this .env config:

SQL_TRANSPILATION_LIBRARY=sqlglot
bazooka720 commented 1 week ago

Sounds good. Will have this tried by my team member very soon

On Wed, Jun 19, 2024, 12:58 AM Yian @.***> wrote:

Another thing I would add is that if you are running this in conjunction with the core DJ API, you should deploy the API with a SQL transpilation library so that the queries generated by DJ can always be transpiled correctly to the Trino dialect. This is done by running the server with this .env config:

SQL_TRANSPILATION_LIBRARY=sqlglot

— Reply to this email directly, view it on GitHub https://github.com/DataJunction/dj/issues/1064#issuecomment-2177744956, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGVA56SGWWLJVE2CJQB6Y4LZIEFW3AVCNFSM6AAAAABJRCFTH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZXG42DIOJVGY . You are receiving this because you were mentioned.Message ID: @.***>

bazooka720 commented 1 week ago

One question I had (I posted in discussion section) ..how do you differ from cube.dev I mean in terms of semantic later capabilities? Can it connect (or rather planned to implement) ability to connect to BI platforms like power bi, tableeau etc?

On Wed, Jun 19, 2024, 1:07 AM Krish C @.***> wrote:

Sounds good. Will have this tried by my team member very soon

On Wed, Jun 19, 2024, 12:58 AM Yian @.***> wrote:

Another thing I would add is that if you are running this in conjunction with the core DJ API, you should deploy the API with a SQL transpilation library so that the queries generated by DJ can always be transpiled correctly to the Trino dialect. This is done by running the server with this .env config:

SQL_TRANSPILATION_LIBRARY=sqlglot

— Reply to this email directly, view it on GitHub https://github.com/DataJunction/dj/issues/1064#issuecomment-2177744956, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGVA56SGWWLJVE2CJQB6Y4LZIEFW3AVCNFSM6AAAAABJRCFTH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZXG42DIOJVGY . You are receiving this because you were mentioned.Message ID: @.***>

shangyian commented 1 week ago

The primary difference between our approach and cube.dev is that the latter is very cube-centric, so a lot of the dimensional modeling is done on a per-cube basis, and is hard to reuse outside of that cube (in a different cube). However, the approach DJ takes is that we store the entire dimensional graph of relationships between entities, so users should only need to model their data once, and it can be shareable across the business.

Another difference is that for cube.dev, the expectation is that any SQL transformations needed for data modeling have been done upstream. However, in our approach, we provide an optional SQL transformation layer (through the transform node), in case you need to do lightweight data transformations here. This can be useful for deeper lineage into metrics.

We do plan on implementing the ability to connect to other BI platforms. @betodealmeida previously worked on a connection to Superset, but we need some more work to revive that connection.

bazooka720 commented 1 week ago

Thanks. So reg Trino, can we use trino to also be used for materialization through iceberg catalogs (HIVe) which we have (Trino Iceberg MV through Hive have capability for cache too)? Regarding materialization, what are automatically (on a schedule done) managed ? (i.e. what are are objects that are materialized). Any chance we can have different refresh schedules for different cubes

bazooka720 commented 1 week ago

For generating of SQL for trino, any changes we need to make (if we want DJ to generate and then execute for input of metrics, filters and time grains). ?

shangyian commented 1 week ago

For generating SQL for Trino, the only thing you would need is to set up the sql transpilation lib as I mentioned in https://github.com/DataJunction/dj/issues/1064#issuecomment-2177744956

The core DJ API has support for configuring materialization, where you can set things on a schedule, and there are different materialization types (iceberg tables for transforms, iceberg + druid for cubes). However, you would again need to implement the appropriate endpoints on the query service in order for this to work.

bazooka720 commented 1 week ago

Right. We need to configure the end-points to interact with Trino REST API. We don't use druid. What options exist as alternative for iceberg+druid?

samredai commented 1 week ago

@bazooka720 if you're interested, I think it would be great if you could join the community sync that's held every other Wednesday at 9am PT. I've created a google group to make it easier to find and get invited. https://groups.google.com/g/datajunction-community

bazooka720 commented 1 week ago

Absolutely. Will join. Just one thing on this is regarding materialized views and trino. As of now this doesn't support automated Materialized Views of all Cube measures and source nodes using trino as the engine and into iceberg based tables?

shangyian commented 1 week ago

Absolutely. Will join. Just one thing on this is regarding materialized views and trino. As of now this doesn't support automated Materialized Views of all Cube measures and source nodes using trino as the engine and into iceberg based tables?

Nope, at the moment there's no automated materialized views creation. But it would be great to chat with you @bazooka720 since I think a lot of these things are going to be easier to discuss over a call.