elementary-data / elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
https://www.elementary-data.com/
Apache License 2.0
1.91k stars 164 forks source link

ignore_small_changes on BigQuery #1546

Open Emmanuel2RPA opened 4 months ago

Emmanuel2RPA commented 4 months ago

Hello, this is copy of my message on slack

Platform: BigQuery

I have problem with the ignore_small_changes params, when I try to set the spike_failure_percent_threshold to 2% I get a very strange result on the threshold range in my volume anomalie test, it's literally a 2-fold increase in the training_avg value. So after a little research, it seems that this i caused by a problem of calculation priority, let me explain : When max value for volume anomalie test is calculated via this function : get_anomaly_scores_query.sql (tell me if I'm saying something wrong) The max value is computed with the following code :

    {%- set max_val -%}
      {{ test_configuration.anomaly_sensitivity }} * training_stddev + training_avg
    {%- endset -%}
    {% if test_configuration.ignore_small_changes.spike_failure_percent_threshold %}
      {%- set spike_avg_threshold -%}
        ((1 + {{ test_configuration.ignore_small_changes.spike_failure_percent_threshold }}/100.0) * training_avg)
      {%- endset -%}
      {%- set max_val -%}
        {{ elementary.arithmetic_max(spike_avg_threshold, max_val) }}
      {%- endset -%}
    {% endif %}

and the elementary.arithmetic_max function is used. This function is relatively simple and returns, as it's name suggets, the largest value between spike_avg_threshold and max_val (i.e. f(x,y)=0,5(x+y+∣x−y∣))

If we now look at the final calculation in a SQL query on BigQuery, we obtain the following field : (0.5 * (((1 + 100/100.0) * training_avg) + 2 * training_stddev + training_avg + abs(((1 + 100/100.0) * training_avg) - 2 * training_stddev + training_avg))) end as max_metric_value,

Where spike_avg_threshold = ((1 + 100/100.0) * training_avg) = x and max_val = 2 * training_stddev + training_avg = y But we can now easly see that we are going to have problem with the calcul because of the "2 * training_stddev + training_avg" part, in fact, the multiplication will apply only on the training_stddev value and wait the addition priority to sum the training_avg value "at the end" of this part of the calcul. This will make the calculation false and return a result other than the one expected, in this case, the highest value between spike_avg_threshold and max_val. For exemple, if I set the spike_failure_percent_threshold: 2 I get an acceptance margin multiplied by 2 (100%) and not just 2%.

Logically, it would "be enough" to add parentheses to this problematic part to resolve the error, but I wanted to explain it to you to make sure I hadn't missed anything or made a mistake.

I'm available if my explanation isn't clear. Good day to you.

haritamar commented 4 months ago

Hi @Emmanuel2RPA ! Thanks for opening this and for the detailed explanation, I believe this bug was actually fixed here, though we haven't released it in an official version just yet, but we'll do so soon. There were indeed missing parentheses.