elementary-data / elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
https://www.elementary-data.com/
Apache License 2.0
1.87k stars 158 forks source link

Generated result query in report contains an invalid table #1483

Open cshlin opened 5 months ago

cshlin commented 5 months ago

Describe the bug When trying to copy the result query to view the failed rows, I get a query that contains an invalid tmp table

To Reproduce Steps to reproduce the behavior:

  1. On the generated EDR html report, click on Results
  2. Expand a test failure, and in the Result section, click the Copy button beside the Result Query
  3. Receive the following query (see the FROM clause on line 50 in the attached .txt file)

edr.txt

Expected behavior A query that provides me with the failed rows

Screenshots

EDR

Environment (please complete the following information):

Additional context Add any other context about the problem here.

haritamar commented 3 months ago

Hi @cshlin ! Thanks for posting this issue and sorry for the delayed response. You're right - there is a temp table here that is available during the test run, but is not available afterwards. We need to save instead a query that will not require this temp table.

Any chance you'd like to contribute a fix to this? I'd be happy to provide guidance with a possible approach to fix this.

cshlin commented 3 months ago

Absolutely, point me in the right direction and I can implement CharlesOn May 19, 2024, at 15:45, Itamar Hartstein @.***> wrote: Hi @cshlin ! Thanks for posting this issue and sorry for the delayed response. You're right - there is a temp table here that is available during the test run, but is not available afterwards. We need to save instead a query that will not require this temp table. Any chance you'd like to contribute a fix to this? I'd be happy to provide guidance with a possible approach to fix this.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

haritamar commented 3 months ago

Thanks @cshlin !

It's admittedly not a easy fix - but I'll write what I think can be a way to solve it. Essentially I think we need to save a different query in the test_result_query than what is being actually run - a version of the query that does not rely on temp tables. (This is needed, because the metrics of all tests are only saved to the permanent data_monitoring_metrics table in the on_run_end hook of dbt, which is after the tests are actually being run)

These areas likely need to be changed:

  1. get_read_anomaly_scores_query should get a parameter that allows it to compute the anomaly scores directly, and not through a temp table that is previously generated.
  2. We should also ensure that in this mode, get_anomaly_scores_query pulls only from data_monitoring_metrics and not from a union of it with a temporary table.
  3. The store_anomaly_test_results should get an additional saved_anomaly_scores_sql parameter which will contain a query that does not rely on a temp table, and the test_results_query should be computed by using it.

The full flow where the anomaly query is generated and the result is saved exists in each of the anomaly test implementations. For example in test_table_anomalies.sql it would be these rows:

{% set flattened_test = elementary.flatten_test(context["model"]) %}
{% set anomaly_scores_sql = elementary.get_read_anomaly_scores_query() %}
{% do elementary.store_metrics_table_in_cache() %}
{% do elementary.store_anomaly_test_results(flattened_test, anomaly_scores_sql) %}

Please let me know if you plan to look into it, and feel free to ask additional questions.