aws-samples / dbt-glue

This repository contains de dbt-glue adapter
Apache License 2.0
94 stars 60 forks source link

Error running dbt run --profiles-dir profile #63

Closed ashwaninbs closed 1 year ago

ashwaninbs commented 1 year ago

I am trying to follow this article https://aws.amazon.com/blogs/big-data/build-your-data-pipeline-in-your-aws-modern-data-platform-using-aws-lake-formation-aws-glue-and-dbt-core/

I ran dbt debug --profiles-dir profile successfully.

However, for the following i got error:

However, I ran into the following error: dbt run --profiles-dir profile

Error is as follows: ( Can someone point me to the fix please)

Concurrency: 1 threads (target='dev')
14:43:18  
14:43:18  1 of 4 START table model dbt_nyc_metrics.silver_nyctaxi_avg_metrics ............ [RUN]
14:43:22  Glue adapter: Glue returned `error` for statement None for code SqlWrapper2.execute('''create table dbt_nyc_metrics.silver_nyctaxi_avg_metrics__dbt_tmp

    LOCATION 's3://aws-dbt-glue-datalake-xxxxxxxxxxxxxx-us-east-1/dbt_nyc_metrics/silver_nyctaxi_avg_metrics__dbt_tmp/'

    as
      WITH source_avg as ( 
    SELECT avg((CAST(dropoff_datetime as LONG) - CAST(pickup_datetime as LONG))/60) as avg_duration 
    , avg(passenger_count) as avg_passenger_count 
    , avg(trip_distance) as avg_trip_distance 
    , avg(total_amount) as avg_total_amount
    , year
    , month 
    , type
    FROM nyctaxi.records 
    WHERE year = "2016"
    AND dropoff_datetime is not null 
    GROUP BY 5, 6, 7
) 
SELECT *
FROM source_avg'''), AnalysisException: Table or view not found: nyctaxi.records; line 16 pos 9;
'CreateTable `dbt_nyc_metrics`.`silver_nyctaxi_avg_metrics__dbt_tmp`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
+- 'Project [*]
   +- 'SubqueryAlias source_avg
      +- 'Aggregate [unresolvedordinal(5), unresolvedordinal(6), unresolvedordinal(7)], ['avg(((cast('dropoff_datetime as bigint) - cast('pickup_datetime as bigint)) / 60)) AS avg_duration#0, 'avg('passenger_count) AS avg_passenger_count#1, 'avg('trip_distance) AS avg_trip_distance#2, 'avg('total_amount) AS avg_total_amount#3, 'year, 'month, 'type]
         +- 'Filter (('year = 2016) AND isnotnull('dropoff_datetime))
            +- 'UnresolvedRelation [nyctaxi, records], [], false

14:43:22  1 of 4 ERROR creating table model dbt_nyc_metrics.silver_nyctaxi_avg_metrics ... [ERROR in 3.74s]
14:43:22  2 of 4 SKIP relation dbt_nyc_metrics.gold_nyctaxi_cost_metrics ................. [SKIP]
14:43:22  3 of 4 SKIP relation dbt_nyc_metrics.gold_nyctaxi_distance_metrics ............. [SKIP]
14:43:22  4 of 4 SKIP relation dbt_nyc_metrics.gold_nyctaxi_passengers_metrics ........... [SKIP]
14:43:23  
14:43:23  Finished running 4 table models in 66.80s.
14:43:24  
14:43:24  Completed with 1 error and 0 warnings:
14:43:24  
14:43:24  Database Error in model silver_nyctaxi_avg_metrics (models/silver_metrics/silver_nyctaxi_avg_metrics.sql)
14:43:24    Glue cursor returned `error` for statement None for code SqlWrapper2.execute('''create table dbt_nyc_metrics.silver_nyctaxi_avg_metrics__dbt_tmp
14:43:24        
14:43:24        
14:43:24        
14:43:24        LOCATION 's3://aws-dbt-glue-datalake-xxxxxxxxxxxx-us-east-1/dbt_nyc_metrics/silver_nyctaxi_avg_metrics__dbt_tmp/'
14:43:24        
14:43:24        as
14:43:24          WITH source_avg as ( 
14:43:24        SELECT avg((CAST(dropoff_datetime as LONG) - CAST(pickup_datetime as LONG))/60) as avg_duration 
14:43:24        , avg(passenger_count) as avg_passenger_count 
14:43:24        , avg(trip_distance) as avg_trip_distance 
14:43:24        , avg(total_amount) as avg_total_amount
14:43:24        , year
14:43:24        , month 
14:43:24        , type
14:43:24        FROM nyctaxi.records 
14:43:24        WHERE year = "2016"
14:43:24        AND dropoff_datetime is not null 
14:43:24        GROUP BY 5, 6, 7
14:43:24    ) 
14:43:24    SELECT *
14:43:24    FROM source_avg'''), AnalysisException: Table or view not found: nyctaxi.records; line 16 pos 9;
14:43:24    'CreateTable `dbt_nyc_metrics`.`silver_nyctaxi_avg_metrics__dbt_tmp`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, ErrorIfExists
14:43:24    +- 'Project [*]
14:43:24       +- 'SubqueryAlias source_avg
14:43:24          +- 'Aggregate [unresolvedordinal(5), unresolvedordinal(6), unresolvedordinal(7)], ['avg(((cast('dropoff_datetime as bigint) - cast('pickup_datetime as bigint)) / 60)) AS avg_duration#0, 'avg('passenger_count) AS avg_passenger_count#1, 'avg('trip_distance) AS avg_trip_distance#2, 'avg('total_amount) AS avg_total_amount#3, 'year, 'month, 'type]
14:43:24             +- 'Filter (('year = 2016) AND isnotnull('dropoff_datetime))
14:43:24                +- 'UnresolvedRelation [nyctaxi, records], [], false
14:43:24    
14:43:24    compiled SQL at target/run/dbtgluenyctaxidemo/models/silver_metrics/silver_nyctaxi_avg_metrics.sql
14:43:24  
14:43:24  Done. PASS=0 WARN=0 ERROR=1 SKIP=3 TOTAL=4
menuetb commented 1 year ago

Hi,

Have you deployed the Cloud Formation template provided in the blog post? It should deploy the nyctaxi database which seems to be missing in your environment (This database contains the source data used in the project).

Be careful to deploy the Cloud Formation template and your dbt project in us-east-1 region. The nyctaxi is a Glue database and cross region usage is not possible.

ashwaninbs commented 1 year ago

Yes. It worked after i installed CF template.

However, still I had error in step 1. I had to grant Lakeformation permissions to the role used in the solution on nyctaxi and dbt_nyc_metrics databases.

In addition, as you will know users cannot access Athena and run queries, unless they have a user who has been given permission in Lakeformation on the silver and gold tables.

However, in the end it works.

ashwaninbs commented 1 year ago

It will be helpful if you can create some instructions on how to run the solution on mac.

menuetb commented 1 year ago

On a mac, you need to:

  1. Install Python: https://docs.python-guide.org/starting/install3/osx/
  2. Install GIT: https://github.com/git-guides/install-git
  3. Install AWS cli: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
  4. Configure you AWS Account: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html

Then you can just install the adapter like this:

$ python3 -m venv dbt_venv
$ source dbt_venv/bin/activate
$ python3 -m pip install --upgrade pip
$ pip3 install --no-cache-dir --upgrade boto3
$ pip3 install --no-cache-dir dbt-core
$ pip3 install --no-cache-dir dbt-glue

Then you can play the blog starting from the chapter Clone the project