[ELE-69] [ELE-14] Apache Spark Integration

izzye84 commented 1 year ago

Is your feature request related to a problem? Please describe. Currently, Databricks is the only Spark supported adapter

Describe the solution you'd like Add support for the Apache Spark adapter

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context This would enable spark projects that are not using Databricks (e.g., EMR on EC2/EKS, etc.)

Would you be willing to contribute this feature? Probably not

_ELE-14

_ELE-69

elongl commented 1 year ago

Hi @izzye84 ! Thanks a lot for opening this issue. As Databricks itself also uses Spark under-the-hood, we have a good reason to believe that the Spark adapter is already supported, it's just that we never tested it because we don't have any other setup such as EMR.

What do you think about trying to use Elementary on your environment? The Elementary Tutorial is a good place to start. I can aid in the process as needed either here or on our support channel.

izzye84 commented 1 year ago

Thanks for the follow-up @elongl! I will give the tutorial a try with my EMR environment now that I know the functionality might already work—I will let you know how it goes!

elongl commented 1 year ago

Hi @izzye84! Did you get a chance to look at it? Would love to help.

izzye84 commented 1 year ago

Hey @elongl,

I am running into some errors when I run dbt run --select elementary—not all of the tables/views are building successfully (although most are). I am open to your help if you have availability.

elongl commented 1 year ago

Hi @izzye84 , Of course, I would love to help. Would you perhaps be able to share the errors? We can also talk this over at our community Slack so that it'll be more interactive if you prefer.

MatatatDespe commented 1 year ago

Hey @elongl,

I am running into some errors when I run dbt run --select elementary—not all of the tables/views are building successfully (although most are). I am open to your help if you have availability.

Thats because on Spark the macro for current timestamp is not working correctly on the dbt_utils, you need to install spark_utils, dbt_utils=0.8.6, and change the cross_db_utils like this

{% macro current_timestamp() -%}
--    {% set macro = dbt.current_timestamp_backcompat or dbt_utils.current_timestamp %}
    {% set macro =  dbt_utils.current_timestamp %}
    {% if not macro %}
        {{ exceptions.raise_compiler_error("Did not find a `current_timestamp` macro.") }}
    {% endif %}
    {{ return(macro()) }}
{%- endmacro %}

{% macro current_timestamp_in_utc() -%}
--    {% set macro = dbt.current_timestamp_in_utc_backcompat or dbt_utils.current_timestamp_in_utc %}
{% set macro = dbt_utils.current_timestamp_in_utc %}
    {% if not macro %}
        {{ exceptions.raise_compiler_error("Did not find a `current_timestamp_in_utc` macro.") }}
    {% endif %}
    {{ return(macro()) }}
{%- endmacro %}

MatatatDespe commented 1 year ago

I've the following issue after executing a test: [0m14:47:49.760573 [debug] [MainThread]: Spark adapter: Poll response: TGetOperationStatusResp(status=TStatus(statusCode=0, infoMessages=None, sqlState=None, errorCode=None, errorMessage=None), operationState=5, sqlState=None, errorCode=0, errorMessage="org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.sql.AnalysisException: Cannot write incompatible data to table 'schema_mjs_elementary.elementary_test_results':\n- Cannot safely cast 'detected_at': string to timestamp\n\tat org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)\n\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:325)\n\tat org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2 and this is the data: insert into schema_mjs_elementary.elementary_test_results (id,data_issue_id,test_execution_id,test_unique_id,model_unique_id,invocation_id,detected_at,database_name,schema_name,table_name,column_name,test_type,test_sub_type,test_results_description,owners,tags,test_results_query,other,test_name,test_params,severity,status,failures,test_short_name,test_alias,result_rows) values ('5f33389e-616e-4f7d-9ff3-b75ee92a47b9.test.bi_dbt_spark_prj.not_null_bi_common_cities_city_id_pk.ab383051c3',NULL,'5f33389e-616e-4f7d-9ff3-b75ee92a47b9.test.bi_dbt_spark_prj.not_null_bi_common_cities_city_id_pk.ab383051c3','test.bi_dbt_spark_prj.not_null_bi_common_cities_city_idpk.ab383051c3','model.bidbt_spark_prj.bi_common_cities','5f33389e-616e-4f7d-9ff3-b75ee92a47b9','2023-02-01 15:17:49' (

Maayan-s commented 1 year ago

Hi @MatatatDespe, Would you maybe be willing to contribute the fix that worked for you? So future Spark users can enjoy it as well, and you could maintain it when upgrading versions.

MatatatDespe commented 1 year ago

Hi @MatatatDespe, Would you maybe be willing to contribute the fix that worked for you? So future Spark users can enjoy it as well, and you could maintain it when upgrading versions.

Actually I've just put the answer in a comment above, is it clear enough?

Maayan-s commented 1 year ago

Actually no, as I see the code there is the code we already have.. Right?

MatatatDespe commented 1 year ago

Actually i've commented this line -- {% set macro = dbt.current_timestamp_backcompat or dbt_utils.current_timestamp %} and only left this one {% set macro = dbt_utils.current_timestamp %}

Maayan-s commented 1 year ago

Thanks @MatatatDespe, I opened this PR to add Spark supported timestamps: https://github.com/elementary-data/dbt-data-reliability/pull/253

rmahidhara-tmus commented 1 year ago

Here is an issue I am currently facing with the Spark adapter:

When I try to run dbt run --select elementary, All of the models except job_run_results and metrics_anomaly_score succeed. The error that I see seems to be related to the SQL syntax.

For example, in job_run_results model:

Caused by: org.apache.spark.sql.AnalysisException: cannot resolve 'second' given input columns

The error comes from the below executed SQL code snippet.

datediff(
     second, min(cast(run_started_at as timestamp)), max(cast(run_completed_at as timestamp))
) as job_run_execution_time

This implementation of datediff is different from Spark's native datediff function, which is why the query could be failing.

MatatatDespe commented 1 year ago

Have you installed on your project the spark_utils?

rmahidhara-tmus commented 1 year ago

Have you installed on your project the spark_utils?

Yes, I have. Here are my package dependencies:

packages:
  - package: dbt-labs/codegen
    version: 0.9.0
  - package: dbt-labs/dbt_utils
    version: 1.0.0
  - package: calogica/dbt_expectations
    version: [">=0.8.0", "<0.9.0"]
  - package: dbt-labs/spark_utils
    version: 0.3.0
  - package: elementary-data/elementary
    version: 0.6.12

Maayan-s commented 1 year ago

Hi @rmahidhara-tmus, thanks for reporting! I found the problem thanks to the details you shared and added a fix to the PR that is already open here: https://github.com/elementary-data/dbt-data-reliability/pull/253

You can try working with the branch / wait for the release. Just to clarify - this is not official Spark support, so I'm not sure that there won't be additional problems.

elementary-data / elementary

[ELE-69] [ELE-14] Apache Spark Integration #483