aws-samples / dbt-glue

This repository contains de dbt-glue adapter
Apache License 2.0
94 stars 60 forks source link

Unable to connect to glue catalog in different AWS project from current AWS project #174

Open sandeepmullangi2 opened 1 year ago

sandeepmullangi2 commented 1 year ago

Describe the bug

I am trying to run dbt model from current AWS project and I am getting AnalysisException: Table or view not found:xxx.yyyy` as my source is in different AWS project glue catalog. Though i have permissions to read other AWS project catalog, still I am unable to read table from different catalog.

Steps To Reproduce

My source looks like as below

version: 2

sources:
  - name: datalake
    schema: secret
    tables:
      # please maintain alphabetical order
      - name: secret1

My model looks like following

{{
  config(
    materialized="incremental",
    file_format='parquet',
    incremental_strategy='insert_overwrite',
    table_properties={'write.target-file-size-bytes': '268435456'}
  )
}}

with base as 
(
    select * from {{ source('datalake', 'secret1') }}
    where a=123
)

SELECT * 
FROM base 

My target connection looks like as follows where iam explicitly passing catalogid to spark conf

dev:
    type: glue
    role_arn: arn:aws:iam::xxxxxx:role/ccoeGlueEDADBTExecution
    region: us-east-2
    workers: 10
    glue_version: "3.0"
    worker_type: G.2X
    idle_timeout: 10
    schema: "{{ env_var('SCHEMA', 'dbt_wsb') }}"
    conf: --conf hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory 
             --conf  hive.metastore.glue.catalogid=xxxx --conf aws.region=us-east-2
    session_provisioning_timeout_in_seconds: 120
    location: "{{ env_var('LOCATION', 's3://dbt-wsb/models') }}"

When I run dbt model, I get following error

AnalysisException: Table or view not found: `secret.secret1`; line 20 pos 18;
'CreateTable `smullangi_utils`.`datalake_model`, ErrorIfExists
+- 'Project [*]
   +- 'SubqueryAlias base
      +- 'Project [*]
         +- 'Filter ('advertisement_id = 123)
            +- 'UnresolvedRelation [secret.secret1], [], false

Expected behavior

Model should be created as we passed catalogid explicity. We need to know exact way to pass catalogid to source so that my dbt understands and creates table.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

Core:
  - installed: 1.3.4
  - latest:    1.5.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - glue:  1.3.12 - Update available!
  - spark: 1.3.1  - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: macos 13.3.1 (22E261)

The output of python --version: Python 3.9.7

Additional context

We have run glue job from AWS console using python and is working from our project. We are able to read catalog from different AWS project from current AWS project. This tells us that there is no problem with provided access/policies. The problem is only when we run with dbt-glue adapter locally. I followed all approaches mentioned here https://repost.aws/knowledge-center/query-glue-data-catalog-cross-account but still its not working (passed all parameters in conf of profiles.yml) Also we are able to run dbt model using dbt-glue adapter when source table is in current catalog.

sandeepmullangi2 commented 1 year ago

Bumping up for visibility

moomindani commented 1 year ago

To allow cross-account catalog access in Spark SQL, you needs to provide hive.metastore.glue.catalogid in Glue Interactive Session, and configure permission in Glue catalog policy.

In case you want to configure hive.metastore.glue.catalogid through --conf parameter, it needs to be spark.hadoop.hive.metastore.glue.catalogid. The other config like hive.metastore.client.factory.class and aws.region are not required but region needs to be specified if catalog is in another region.

I have tested --conf spark.hadoop.hive.metastore.glue.catalogid=XXXXXXXXXXX in both job and notebook, both succeeded to read from cross account catalog.

AmineIzanami commented 1 year ago

@sandeepmullangi2 as mentioned by @moomindani you need to use --conf spark.hadoop.hive.metastore.glue.catalogid=XXXXXXXXXXX in your dbt profile, as such, you can have multiple outputs for each of the accounts that you have access to, is that fixing your issue ?

sandeepmullangi2 commented 1 year ago

@moomindani @AmineIzanami This did not fix our issue, may be our usecase is bit different. Our usecase is read data from one source catalog and write it to destination catalog. When I used spark.hadoop.hive.metastore.glue.catalogid=XXXXXXXXXXX iam able to read table from source but not able to write to destination catalog and is trying to write back to source catalog and i got permission issue since i dont have write access to source catalog. So this adapter does not provide source and target catalogs to be passed. Since this was creating issue we started using dbt-athena adapter.

AmineIzanami commented 1 year ago

@sandeepmullangi2 indeed the fix that we mentioned is cross account but single tenant, which means that the dbt session will be per account, so the reading and the writing is local to that AWS account, in your case, have you thought of creating resource links in your AWS account with LakeFormation so that the target glue tables are local to your source AWS account ?