dbt-labs / dbt-spark

dbt-spark contains all of the code enabling dbt to work with Apache Spark and Databricks
https://getdbt.com
Apache License 2.0
405 stars 227 forks source link

[ADAP-1085] [Bug] When using iceberg format, dbt docs generate is unable to populate the columns information #968

Open shishircc opened 10 months ago

shishircc commented 10 months ago

Is this a new bug in dbt-spark?

Current Behavior

When using iceberg table format, dbt docs generate creates empty catalog.json and hence provides no column information in documentation

Expected Behavior

DBT docs generate should generate properly populated catalog.json

Steps To Reproduce

  1. Configure EMR for working with iceberg and glue catalog
  2. Setup thrift server
  3. Run dbt project on EMR using thrift server
  4. Run dbt docs generate

Relevant log output

0m02:50:24.631713 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa619c86bb0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa617c572b0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa617c57a60>]}

============================== 02:50:24.638185 | c74f1dd7-ae7c-4cbe-aa70-cdad60a70d35 ==============================
[0m02:50:24.638185 [info ] [MainThread]: Running with dbt=1.7.4
[0m02:50:24.639539 [debug] [MainThread]: running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'True', 'cache_selected_only': 'False', 'warn_error': 'None', 'debug': 'False', 'fail_fast': 'False', 'log_path': '/home/ec2-user/environment/dbtproject/dags/dbt_blueprint/c360-datalake/logs', 'version_check': 'True', 'profiles_dir': '/home/ec2-user/.dbt', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'log_format': 'default', 'invocation_command': 'dbt docs generate --vars {"day": "31","hour": "0","month": "12","raw_bucket":"c360-raw-data-*****-us-east-1","ts": "2023-12-31T00:00:00+00:00","year": "2023"}', 'introspect': 'True', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'target_path': 'None', 'static_parser': 'True', 'send_anonymous_usage_stats': 'True'}
[0m02:50:24.958174 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'project_id', 'label': 'c74f1dd7-ae7c-4cbe-aa70-cdad60a70d35', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa617bece50>]}
[0m02:50:25.203736 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': 'c74f1dd7-ae7c-4cbe-aa70-cdad60a70d35', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa6177c2af0>]}
[0m02:50:25.204633 [info ] [MainThread]: Registered adapter: spark=1.7.0
[0m02:50:25.223511 [debug] [MainThread]: checksum: 577537e0073da8fb99e9f3abffc643b153c4ab719d0d0e1e2dce7637653d4e74, vars: {'day': '31',
 'hour': '0',
 'month': '12',
 'raw_bucket': 'c360-raw-data-********-us-east-1',
 'ts': '2023-12-31T00:00:00+00:00',
 'year': '2023'}, profile: , target: , version: 1.7.4
[0m02:50:25.260540 [debug] [MainThread]: Partial parsing enabled: 0 files deleted, 0 files added, 0 files changed.
[0m02:50:25.261176 [debug] [MainThread]: Partial parsing enabled, no changes found, skipping parsing
[0m02:50:25.269396 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'load_project', 'label': 'c74f1dd7-ae7c-4cbe-aa70-cdad60a70d35', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa6175a6f10>]}
[0m02:50:25.272216 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': 'c74f1dd7-ae7c-4cbe-aa70-cdad60a70d35', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa6176cb2b0>]}
[0m02:50:25.272952 [info ] [MainThread]: Found 7 models, 6 sources, 0 exposures, 0 metrics, 439 macros, 0 groups, 0 semantic models
[0m02:50:25.273827 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'c74f1dd7-ae7c-4cbe-aa70-cdad60a70d35', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa6176cb2e0>]}
[0m02:50:25.276742 [info ] [MainThread]: 
[0m02:50:25.278117 [debug] [MainThread]: Acquiring new spark connection 'master'
[0m02:50:25.280561 [debug] [ThreadPool]: Acquiring new spark connection 'list_None_c360bronze'
[0m02:50:25.295222 [debug] [ThreadPool]: Spark adapter: NotImplemented: add_begin_query
[0m02:50:25.295972 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:25.296507 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
show table extended in c360bronze like '*'

[0m02:50:25.296985 [debug] [ThreadPool]: Opening a new connection, currently in state init
[0m02:50:25.447602 [debug] [ThreadPool]: Spark adapter: Poll response: TGetOperationStatusResp(status=TStatus(statusCode=0, infoMessages=None, sqlState=None, errorCode=None, errorMessage=None), operationState=5, sqlState=None, errorCode=0, errorMessage='org.apache.hive.service.cli.HiveSQLException: Error running query: [_LEGACY_ERROR_TEMP_1200] org.apache.spark.sql.AnalysisException: SHOW TABLE EXTENDED is not supported for v2 tables.;\nShowTableExtended *, [namespace#9839, tableName#9840, isTemporary#9841, information#9842]\n+- ResolvedNamespace org.apache.iceberg.spark.SparkCatalog@6b21d3da, [c360bronze]\n\n\tat org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)\n\tat 
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)\n\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:226)\n\t... 16 more\n', taskStatus=None, operationStarted=None, operationCompleted=None, hasResultSet=None, progressUpdateResponse=None)
[0m02:50:25.448538 [debug] [ThreadPool]: Spark adapter: Poll status: 5
[0m02:50:25.449121 [debug] [ThreadPool]: Spark adapter: Error while running:
/* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
show table extended in c360bronze like '*'

[0m02:50:25.449863 [debug] [ThreadPool]: Spark adapter: Database Error
  org.apache.hive.service.cli.HiveSQLException: Error running query: [_LEGACY_ERROR_TEMP_1200] org.apache.spark.sql.AnalysisException: SHOW TABLE EXTENDED is not supported for v2 tables.;
  ShowTableExtended *, [namespace#9839, tableName#9840, isTemporary#9841, information#9842]
  +- ResolvedNamespace org.apache.iceberg.spark.SparkCatalog@6b21d3da, [c360bronze]

    at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
    at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:261)

  Caused by: org.apache.spark.sql.AnalysisException: SHOW TABLE EXTENDED is not supported for v2 tables.;
  ShowTableExtended *, [namespace#9839, tableName#9840, isTemporary#9841, information#9842]
  +- ResolvedNamespace org.apache.iceberg.spark.SparkCatalog@6b21d3da, [c360bronze]

    at org.apache.spark.sql.errors.QueryCompilationErrors$.commandUnsupportedInV2TableError(QueryCompilationErrors.scala:2040)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1(CheckAnalysis.scala:224)
    ... 16 more

[0m02:50:25.450777 [debug] [ThreadPool]: Spark adapter: Error while running:
macro list_relations_without_caching
[0m02:50:25.451525 [debug] [ThreadPool]: Spark adapter: Runtime Error
  Database Error
    org.apache.hive.service.cli.HiveSQLException: Error running query: [_LEGACY_ERROR_TEMP_1200] org.apache.spark.sql.AnalysisException: SHOW TABLE EXTENDED is not supported for v2 tables.;
    ShowTableExtended *, [namespace#9839, tableName#9840, isTemporary#9841, information#9842]
    +- ResolvedNamespace org.apache.iceberg.spark.SparkCatalog@6b21d3da, [c360bronze]

        at java.lang.Thread.run(Thread.java:750)
    Caused by: org.apache.spark.sql.AnalysisException: SHOW TABLE EXTENDED is not supported for v2 tables.;
    ShowTableExtended *, [namespace#9839, tableName#9840, isTemporary#9841, information#9842]
    +- ResolvedNamespace org.apache.iceberg.spark.SparkCatalog@6b21d3da, [c360bronze]

        at org.apache.spark.sql.errors.QueryCompilationErrors$.commandUnsupportedInV2TableError(QueryCompilationErrors.scala:2040)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1(CheckAnalysis.scala:224)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$1$adapted(CheckAnalysis.scala:163)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:338)
        at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:226)
        ... 16 more

[0m02:50:25.457505 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:25.458084 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
show tables in c360bronze like '*'

[0m02:50:25.697421 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:25.698186 [debug] [ThreadPool]: SQL status: OK in 0.0 seconds
[0m02:50:25.708726 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:25.709422 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
describe extended c360bronze.stg_clickstream

[0m02:50:25.895379 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:25.896238 [debug] [ThreadPool]: SQL status: OK in 0.0 seconds
[0m02:50:25.905061 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:25.905728 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
describe extended c360bronze.stg_clickstream2

[0m02:50:26.128223 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:26.128935 [debug] [ThreadPool]: SQL status: OK in 0.0 seconds
[0m02:50:26.139356 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:26.140353 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
describe extended c360bronze.stg_salesdb__cart_items

[0m02:50:26.370865 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:26.371741 [debug] [ThreadPool]: SQL status: OK in 0.0 seconds
[0m02:50:26.381025 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:26.381722 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
describe extended c360bronze.stg_salesdb__customer

[0m02:50:26.572163 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:26.573853 [debug] [ThreadPool]: SQL status: OK in 0.0 seconds
[0m02:50:26.584021 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:26.584680 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
describe extended c360bronze.stg_salesdb__order_items

[0m02:50:26.783624 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:26.784357 [debug] [ThreadPool]: SQL status: OK in 0.0 seconds
[0m02:50:26.795421 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:26.796071 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
describe extended c360bronze.stg_salesdb__product

[0m02:50:26.987528 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:26.988230 [debug] [ThreadPool]: SQL status: OK in 0.0 seconds
[0m02:50:26.996066 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:26.996669 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
describe extended c360bronze.stg_salesdb__product_rating

[0m02:50:27.228290 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:27.229005 [debug] [ThreadPool]: SQL status: OK in 0.0 seconds
[0m02:50:27.237495 [debug] [ThreadPool]: Using spark connection "list_None_c360bronze"
[0m02:50:27.238161 [debug] [ThreadPool]: On list_None_c360bronze: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "list_None_c360bronze"} */
describe extended c360bronze.stg_supportdb__support_chat

[0m02:50:27.439212 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:27.439901 [debug] [ThreadPool]: SQL status: OK in 0.0 seconds
[0m02:50:27.445604 [debug] [ThreadPool]: On list_None_c360bronze: ROLLBACK
[0m02:50:27.446268 [debug] [ThreadPool]: Spark adapter: NotImplemented: rollback
[0m02:50:27.446799 [debug] [ThreadPool]: On list_None_c360bronze: Close
[0m02:50:27.570900 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'c74f1dd7-ae7c-4cbe-aa70-cdad60a70d35', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa617832d60>]}
[0m02:50:27.572181 [info ] [MainThread]: Concurrency: 1 threads (target='dev')
[0m02:50:27.573155 [info ] [MainThread]: 
[0m02:50:27.576111 [debug] [Thread-1  ]: Began running node model.c360.stg_clickstream
[0m02:50:27.578024 [debug] [Thread-1  ]: Re-using an available connection from the pool (formerly list_None_c360bronze, now model.c360.stg_clickstream)
[0m02:50:27.578766 [debug] [Thread-1  ]: Began compiling node model.c360.stg_clickstream
[0m02:50:27.603421 [debug] [Thread-1  ]: Writing injected SQL for node "model.c360.stg_clickstream"
[0m02:50:27.604594 [debug] [Thread-1  ]: Timing info for model.c360.stg_clickstream (compile): 02:50:27.579148 => 02:50:27.604210
[0m02:50:27.605277 [debug] [Thread-1  ]: Began executing node model.c360.stg_clickstream
[0m02:50:27.606030 [debug] [Thread-1  ]: Timing info for model.c360.stg_clickstream (execute): 02:50:27.605629 => 02:50:27.605653
[0m02:50:27.607610 [debug] [Thread-1  ]: Finished running node model.c360.stg_clickstream
[0m02:50:27.608426 [debug] [Thread-1  ]: Began running node model.c360.stg_salesdb__cart_items
[0m02:50:27.609975 [debug] [Thread-1  ]: Re-using an available connection from the pool (formerly model.c360.stg_clickstream, now model.c360.stg_salesdb__cart_items)
[0m02:50:27.610700 [debug] [Thread-1  ]: Began compiling node model.c360.stg_salesdb__cart_items
[0m02:50:27.619178 [debug] [Thread-1  ]: Writing injected SQL for node "model.c360.stg_salesdb__cart_items"
[0m02:50:27.620232 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__cart_items (compile): 02:50:27.611076 => 02:50:27.619876
[0m02:50:27.620860 [debug] [Thread-1  ]: Began executing node model.c360.stg_salesdb__cart_items
[0m02:50:27.621648 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__cart_items (execute): 02:50:27.621216 => 02:50:27.621229
[0m02:50:27.624942 [debug] [Thread-1  ]: Finished running node model.c360.stg_salesdb__cart_items
[0m02:50:27.625894 [debug] [Thread-1  ]: Began running node model.c360.stg_salesdb__customer
[0m02:50:27.627310 [debug] [Thread-1  ]: Re-using an available connection from the pool (formerly model.c360.stg_salesdb__cart_items, now model.c360.stg_salesdb__customer)
[0m02:50:27.628163 [debug] [Thread-1  ]: Began compiling node model.c360.stg_salesdb__customer
[0m02:50:27.635786 [debug] [Thread-1  ]: Writing injected SQL for node "model.c360.stg_salesdb__customer"
[0m02:50:27.636779 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__customer (compile): 02:50:27.628650 => 02:50:27.636440
[0m02:50:27.637526 [debug] [Thread-1  ]: Began executing node model.c360.stg_salesdb__customer
[0m02:50:27.638335 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__customer (execute): 02:50:27.637949 => 02:50:27.637961
[0m02:50:27.639622 [debug] [Thread-1  ]: Finished running node model.c360.stg_salesdb__customer
[0m02:50:27.640276 [debug] [Thread-1  ]: Began running node model.c360.stg_salesdb__order_items
[0m02:50:27.641442 [debug] [Thread-1  ]: Re-using an available connection from the pool (formerly model.c360.stg_salesdb__customer, now model.c360.stg_salesdb__order_items)
[0m02:50:27.642196 [debug] [Thread-1  ]: Began compiling node model.c360.stg_salesdb__order_items
[0m02:50:27.650906 [debug] [Thread-1  ]: Writing injected SQL for node "model.c360.stg_salesdb__order_items"
[0m02:50:27.652276 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__order_items (compile): 02:50:27.642683 => 02:50:27.651808
[0m02:50:27.653043 [debug] [Thread-1  ]: Began executing node model.c360.stg_salesdb__order_items
[0m02:50:27.653692 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__order_items (execute): 02:50:27.653397 => 02:50:27.653410
[0m02:50:27.655082 [debug] [Thread-1  ]: Finished running node model.c360.stg_salesdb__order_items
[0m02:50:27.655742 [debug] [Thread-1  ]: Began running node model.c360.stg_salesdb__product
[0m02:50:27.656697 [debug] [Thread-1  ]: Re-using an available connection from the pool (formerly model.c360.stg_salesdb__order_items, now model.c360.stg_salesdb__product)
[0m02:50:27.657630 [debug] [Thread-1  ]: Began compiling node model.c360.stg_salesdb__product
[0m02:50:27.744326 [debug] [Thread-1  ]: Writing injected SQL for node "model.c360.stg_salesdb__product"
[0m02:50:27.745610 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__product (compile): 02:50:27.658152 => 02:50:27.745075
[0m02:50:27.746830 [debug] [Thread-1  ]: Began executing node model.c360.stg_salesdb__product
[0m02:50:27.747641 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__product (execute): 02:50:27.747323 => 02:50:27.747337
[0m02:50:27.749546 [debug] [Thread-1  ]: Finished running node model.c360.stg_salesdb__product
[0m02:50:27.750300 [debug] [Thread-1  ]: Began running node model.c360.stg_salesdb__product_rating
[0m02:50:27.751857 [debug] [Thread-1  ]: Re-using an available connection from the pool (formerly model.c360.stg_salesdb__product, now model.c360.stg_salesdb__product_rating)
[0m02:50:27.752630 [debug] [Thread-1  ]: Began compiling node model.c360.stg_salesdb__product_rating
[0m02:50:27.760353 [debug] [Thread-1  ]: Writing injected SQL for node "model.c360.stg_salesdb__product_rating"
[0m02:50:27.761355 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__product_rating (compile): 02:50:27.753148 => 02:50:27.761013
[0m02:50:27.762149 [debug] [Thread-1  ]: Began executing node model.c360.stg_salesdb__product_rating
[0m02:50:27.762897 [debug] [Thread-1  ]: Timing info for model.c360.stg_salesdb__product_rating (execute): 02:50:27.762503 => 02:50:27.762526
[0m02:50:27.764336 [debug] [Thread-1  ]: Finished running node model.c360.stg_salesdb__product_rating
[0m02:50:27.765582 [debug] [Thread-1  ]: Began running node model.c360.stg_supportdb__support_chat
[0m02:50:27.768105 [debug] [Thread-1  ]: Re-using an available connection from the pool (formerly model.c360.stg_salesdb__product_rating, now model.c360.stg_supportdb__support_chat)
[0m02:50:27.768876 [debug] [Thread-1  ]: Began compiling node model.c360.stg_supportdb__support_chat
[0m02:50:27.776378 [debug] [Thread-1  ]: Writing injected SQL for node "model.c360.stg_supportdb__support_chat"
[0m02:50:27.777509 [debug] [Thread-1  ]: Timing info for model.c360.stg_supportdb__support_chat (compile): 02:50:27.769323 => 02:50:27.777059
[0m02:50:27.778705 [debug] [Thread-1  ]: Began executing node model.c360.stg_supportdb__support_chat
[0m02:50:27.779701 [debug] [Thread-1  ]: Timing info for model.c360.stg_supportdb__support_chat (execute): 02:50:27.779171 => 02:50:27.779367
[0m02:50:27.781293 [debug] [Thread-1  ]: Finished running node model.c360.stg_supportdb__support_chat
[0m02:50:27.782634 [debug] [MainThread]: Connection 'master' was properly closed.
[0m02:50:27.783134 [debug] [MainThread]: Connection 'model.c360.stg_supportdb__support_chat' was properly closed.
[0m02:50:27.785129 [debug] [MainThread]: Command end result
[0m02:50:27.800011 [debug] [MainThread]: Acquiring new spark connection 'generate_catalog'
[0m02:50:27.800584 [info ] [MainThread]: Building catalog
[0m02:50:27.804828 [debug] [ThreadPool]: Acquiring new spark connection 'spark_catalog.c360raw'
[0m02:50:27.805619 [debug] [ThreadPool]: On "spark_catalog.c360raw": cache miss for schema ".spark_catalog.c360raw", this is inefficient
[0m02:50:27.811565 [debug] [ThreadPool]: Spark adapter: NotImplemented: add_begin_query
[0m02:50:27.812110 [debug] [ThreadPool]: Using spark connection "spark_catalog.c360raw"
[0m02:50:27.812600 [debug] [ThreadPool]: On spark_catalog.c360raw: /* {"app": "dbt", "dbt_version": "1.7.4", "profile_name": "c360", "target_name": "dev", "connection_name": "spark_catalog.c360raw"} */
show table extended in spark_catalog.c360raw like '*'

[0m02:50:27.813343 [debug] [ThreadPool]: Opening a new connection, currently in state init
[0m02:50:30.996262 [debug] [ThreadPool]: Spark adapter: Poll status: 2, query complete
[0m02:50:30.997081 [debug] [ThreadPool]: SQL status: OK in 3.0 seconds
[0m02:50:31.017468 [debug] [ThreadPool]: While listing relations in database=, schema=spark_catalog.c360raw, found: cart_items, customer, order_items, product, product_rating, simulation, support_chat
[0m02:50:31.018414 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360raw.cart_items
[0m02:50:31.019253 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360raw.customer
[0m02:50:31.020171 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360raw.order_items
[0m02:50:31.020980 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360raw.product
[0m02:50:31.021837 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360raw.product_rating
[0m02:50:31.022754 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360raw.simulation
[0m02:50:31.023464 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360raw.support_chat
[0m02:50:31.030727 [debug] [ThreadPool]: On spark_catalog.c360raw: ROLLBACK
[0m02:50:31.032244 [debug] [ThreadPool]: Spark adapter: NotImplemented: rollback
[0m02:50:31.034169 [debug] [ThreadPool]: On spark_catalog.c360raw: Close
[0m02:50:31.153712 [debug] [ThreadPool]: Re-using an available connection from the pool (formerly spark_catalog.c360raw, now c360bronze)
[0m02:50:31.154930 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360bronze.stg_clickstream
[0m02:50:31.156753 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360bronze.stg_clickstream2
[0m02:50:31.159402 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360bronze.stg_salesdb__cart_items
[0m02:50:31.160427 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360bronze.stg_salesdb__customer
[0m02:50:31.161208 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360bronze.stg_salesdb__order_items
[0m02:50:31.161775 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360bronze.stg_salesdb__product
[0m02:50:31.162315 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360bronze.stg_salesdb__product_rating
[0m02:50:31.162877 [debug] [ThreadPool]: Spark adapter: Getting table schema for relation c360bronze.stg_supportdb__support_chat
[0m02:50:31.191842 [info ] [MainThread]: Catalog written to /home/ec2-user/environment/dbtproject/dags/dbt_blueprint/c360-datalake/target/catalog.json
[0m02:50:31.195884 [debug] [MainThread]: Resource report: {"command_name": "generate", "command_success": true, "command_wall_clock_time": 6.6277905, "process_user_time": 3.394287, "process_kernel_time": 0.149442, "process_mem_max_rss": "104176", "process_out_blocks": "4960", "process_in_blocks": "0"}
[0m02:50:31.198437 [debug] [MainThread]: Command `dbt docs generate` succeeded at 02:50:31.198171 after 6.63 seconds
[0m02:50:31.199040 [debug] [MainThread]: Connection 'generate_catalog' was properly closed.
[0m02:50:31.199666 [debug] [MainThread]: Connection 'c360bronze' was properly closed.
[0m02:50:31.200187 [debug] [MainThread]: Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa619c86bb0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa6179c4370>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa6177c2af0>]}
[0m02:50:31.201812 [debug] [MainThread]: Flushing usage events

Environment

- OS:Amazon linux {"metadata": {"dbt_schema_version": "https://schemas.getdbt.com/dbt/catalog/v1.json", "dbt_version": "1.7.4", "generated_at": "2024-01-02T02:39:55.359210Z", "invocation_id": "4f9b9ed4-e962-49bf-8329-df43b335419a", "env": {}}, "nodes": {}, "sources": {}, "errors": null}
- Python: 3.10
- dbt-core: 1.7.4
- dbt-spark: 1.7.0

Additional Context

This is the catalog.json generated by dbt docs generate ... {"metadata": {"dbt_schema_version": "https://schemas.getdbt.com/dbt/catalog/v1.json", "dbt_version": "1.7.4", "generated_at": "2024-01-02T02:39:55.359210Z", "invocation_id": "4f9b9ed4-e962-49bf-8329-df43b335419a", "env": {}}, "nodes": {}, "sources": {}, "errors": null}

shishircc commented 10 months ago

emr_dag_automation_blueprint.py.txt Attached DAG can be used to create the EMR with the above configuration

Here is the requirements.txt

--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.6.3/constraints-3.10.txt"

apache-airflow==2.6.3
apache-airflow-providers-salesforce
apache-airflow-providers-apache-spark
apache-airflow-providers-amazon
apache-airflow-providers-postgres
apache-airflow-providers-mongo
apache-airflow-providers-ssh
apache-airflow-providers-common-sql
astronomer-cosmos
boto3
simplejson
pymongo
pymssql
smart-open
psycopg2==2.9.5
simple-salesforce