dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.92k stars 1.63k forks source link

sanity tests for continuous integration #564

Closed HarlanH closed 2 years ago

HarlanH commented 7 years ago

(Note, not schema/data tests.)

We've got dbt set up with CircleCI, and right now pushes to our master branch create a new Docker image and push to production with no tests. This feels... risky. One thing we can do is to run dbt compile and check for errors, but it'd be great to get some simple testing of the SQL too. One option would be to compile and run the tables with EXPLAIN PLAN instead of CREATE TABLE (and not to run post-hooks). That should be pretty fast, and SQL syntax (and some semantic) errors would be caught too.

Would this make sense? Or something else?

drewbanin commented 7 years ago

hey @HarlanH - in the early days of dbt, we had dbt run --dry which did something similar to what you're describing here. It ended up not being that useful the way we implemented it, but I think it might be appropriate to give it another look.

I like your idea of using explain. Alternatively, we could possibly materialize everything as ephemeral and run the select statements wrapped with limit 0. TBD how query planners feel about. I think BigQuery might have a built-in dry-run flag too.

So, this is definitely on our radar!

adrianisk commented 6 years ago

We're in pretty much the same boat right now - running 'dbt compile' in CircleCI doesn't quite check enough, and running the schema/data tests doesn't seem appropriate as an issue with the data shouldn't block a build. We've had multiple requests from our data science team to be able to do a 'dry run' of dbt while developing, and I'd like to add more stringent checks after the build.

chrisdhanson commented 5 years ago

Has there been any movement on this ticket?

drewbanin commented 5 years ago

hey @chrisdhanson - no movement on this in particular, but we are in the process of reshaping dbt's materializations to support custom workflows like this one.

At this point, it's not super clear to me what the desired behavior is here! Can you say a little more about how you'd like this feature to work? At Fishtown Analytics, we're in the habit of using GitHub hooks to do a full run/test of the project on every Pull Request. That's a little different than the workflow described in the comments above, but it works for us. Keen to hear your thoughts!

chrisdhanson commented 5 years ago

@drewbanin What kinds of tests are you running?

We're pretty early in our implementation but we'll be building models for an analytics database so being able to run test queries is going to be important. I can see in the built-in data tests that there are options for testing queries, but they seem to be mainly for internal consistency checks rather than verifying that the queries in the model sql files are themselves valid. Additionally, to run tests on generated tables/views, you have to generate them first and if you push changes to production that fail to build, your tests won't be much good.

For example, the default behavior for the data tests is to fail if rows are returned, but that means you can't run a test query with limit 0 if the error condition isn't a SQL Exception. There's probably a way to jimmy these tests into working that way, but truthfully I haven't explored it enough yet.

drewbanin commented 5 years ago

@chrisdhanson We do an entire run of the project into a sandbox schema (like pr_1234) , then run our tests on the models in that schema to assert that our assumptions hold with the new code. When the PR is closed, we drop the schema.

We use Sinter, and in practice, it looks something like this:

screen shot 2018-11-16 at 12 14 22 pm

And in GitHub:

screen shot 2018-11-16 at 12 20 28 pm

I think the comments above are in regards to how to do this faster, as a full rebuild can take a long time for a mature dbt project.

As a disclaimer: Fishtown Analytics makes Sinter. I think it's a great tool that solves a lot of these problems, but there's no reason you couldn't replicate this workflow yourself using whatever tools you prefer. We're just listening for GitHub webhooks and running some dbt commands for the most part.

Hope this helps!

kforeman commented 5 years ago

One feature we'd love is the ability to use Snowflake's TABLESAMPLE when doing a dbt run for CI checks. That way we could do real data tests but with just a fraction of our data to reduce both the time and expense associated with doing a full rebuild.

sjwhitworth commented 5 years ago

We'd love to see the ability to support BigQuery's dry run capability. This way of running a query does the following:

At Monzo, we use them extensively for validating that there are no syntax errors in queries. They're free, which is nice! 💰This would help us with fast CI runs.

I've got a small patch locally that seems to work. Obviously, this is entirely BigQuery specific, and horribly breaks other things like incremental tables.

adapters/bigquery/connections.py

def raw_execute(self, sql, name=None, fetch=False, dry_run=False):
        conn = self.get(name)
        client = conn.handle

        logger.debug('On %s: %s', name, sql)

        job_config = google.cloud.bigquery.QueryJobConfig()
        job_config.use_legacy_sql = False
        job_config.dry_run = dry_run

        query_job = client.query(sql, job_config)

        # there's no results for dry run queries, so we bail early
        if dry_run:
            return query_job, None

        # this blocks until the query has completed
        with self.exception_handler(sql, conn.name):
            iterator = query_job.result()

        return query_job, iterator

    def execute(self, sql, name=None, auto_begin=False, fetch=None):
        dry_run = os.getenv("DRY_RUN") == "true" 
        if dry_run:
            fetch = False

        # auto_begin is ignored on bigquery, and only included for consistency
        _, iterator = self.raw_execute(sql, name=name, fetch=fetch, dry_run=dry_run)

        if fetch:
            res = self.get_table_from_response(iterator)
        else:
            res = dbt.clients.agate_helper.empty_table()

        # If we get here, the query succeeded
        status = 'OK'
        return status, res
bastienboutonnet commented 5 years ago

Since @sjwhitworth 's presentation at the dbt meetup and a few recent "oops I forgot to test" PRs. I decided I wanna get pretty strong CI.

I'd be more than happy to help with implementing Snowflake's TABLESAMPLE so that our users can test that their SQL would run, a CI container could execute a dry run more rapidly etc.

The only caveat I foresee, is that when running tests from dbt doing this on a subsample may upset some of the tests. I haven't looked much into the materialisations of the tests. In Snowflake it seems that SAMPLE is part of the query itself so my gut feeling is we'd have to add something at the end like the is_incremental flag or make it an option in the materialisations.

@drewbanin do you have any suggestions so I can get started in the right direction maybe?

drewbanin commented 5 years ago

hey @bastienboutonnet - as I understand it, @sjwhitworth's approach does two things:

  1. executes queries in dry-run mode on BQ
  2. automatically selects a subset of data using a limit on BQ

The dry-run flag is BigQuery specific: it works like an explain plan on Postgres/Redshift, validating a query and returning an estimated cost without actually running the query. I don't believe any sort of analog exists in Snowflake, and I found this comment in the Snowflake Lodge support site:

...Also, and perhaps more importantly, our execution plans can dynamically change based on the effectiveness of our pruning (micro-partition elimination) phase, so we do not know the actual execution plan until we have already started the query processing.

The whole post is worth reading, but the takeaway for me is that Snowflake doesn't currently provide any mechanisms for validating queries without actually running them.

The tablesample example you sent over is interesting too, but I think about approaches like that a little bit differently than the dry-run workflow. If sampling like this were enabled, dbt would still build objects in the target database, it would just happen faster than usual. I agree with your assessment though - this approach could definitely mess up tests on foreign keys (and possibly other types of custom tests too).

One currently viable approach could be to sample in the models that select from source data. Eg:

select *
from {{ source('ecom', 'orders') }}
{% if target.name == 'dev' %}
sample (100 rows)
{% endif %}

If you wanted to get really clever, you could make a macro called sampled() which in turn calls source(), applying a sample as appropriate. There are probably good ways to bake that right into dbt (either as a source configs, or via something like https://github.com/fishtown-analytics/dbt/issues/1096). Happy to explore those in a separate issue if you're interested in discussing!

kforeman commented 5 years ago

@liveandletbri and @danieldicker implemented a SAMPLE(0) macro for our Snowflake tables which then got invoked in a new temporary schema.

It worked in that it validated the schema, but the execution time was much longer than we anticipated even when appending that to each table so we haven't been using it yet.

Currently evaluating using CREATE TABLE ... LIKE to create a new, empty database/schema to run the dry run off of or waiting for a Snowflake EXPLAIN (which is apparently coming, but we don't know when).

bastienboutonnet commented 5 years ago

@drewbanin I really really like your proposal. Indeed, great to clarify that a solution comparable to BQ's dry-run is currently not 1 to 1 implementable in Snowflake, but I like the idea of speeding things up possibly when testing model executions, when for example doing things locally. Indeed, when testing assumptions with tests, there doesn't seem to be a viable way for tests not to run on a full model so for sure. However, there is a way to make the sampling be repeatable so if always placed on the source maybe actually tests wouldn't get messed up. This demands testing though -which I'd be happy to investigate no matter what.

@kforeman I was thinking of indeed doing something around that. But your point about " the execution time was much longer than we anticipated" maybe suggests the benefits for this are low. Can you elaborate a bit? What were you expecting?

What would be the advantage of create table like in this situation @kforeman?

liveandletbri commented 5 years ago

But your point about " the execution time was much longer than we anticipated" maybe suggests the benefits for this are low. Can you elaborate a bit? What were you expecting?

@bastienboutonnet I can speak to the performance issues. We created a macro that could optionally insert SAMPLE(0) and placed it next to the name of every table in our query. For example:

WITH my_cte AS (
  SELECT column_1
  FROM foo.bar SAMPLE(0)
)

SELECT a.column_1, b.column_2
FROM my_cte as a 
CROSS JOIN foo.another_table as b SAMPLE(0) 

The thought was that putting in the limitations at the mention of each table, rather than at the end of the query in the form of LIMIT 0, would speed the query up significantly since no data would flow through. Joins could be performed using zero rows, filters would be applied on zero rows, etc.

The problem is Snowflake still performs a table scan on the table before applying the SAMPLE(0) filter, so large tables could still add a significant amount of time to the queries. Some queries that take 10+ minutes with the full dataset might still run for 2-5 minutes with SAMPLE(0) applied.

That all being said, the large majority of queries are greatly improved by adding SAMPLE(0). It's only an issue on tables with hundreds of millions or billions of rows.

kforeman commented 5 years ago

What would be the advantage of create table like in this situation @kforeman?

Sorry, should have been clearer. We're considering using CREATE TABLE ... LIKE to produce empty versions of the upstream tables (our landing zone in Snowflake, where data is ingested from the production PSQL databases) that are outside of DBT. So we'd have empty source tables and wouldn't have to worry about the slow table scans that @liveandletbri mentioned. Would still use the regular table creation for DBT models.

bastienboutonnet commented 5 years ago

So I found a really interesting thing, when you bring in a table into Tableau it runs a super fast and performant query to know the columns present in the table. It achieves this with the following:

select * from schema.table where 1=0

This translates in a very dry-run like query: nothing gets scanned, nothing gets materialised the query is essentially all initialisation. So I recon that would be pretty powerful.

Sticking this to the source as @drewbanin suggested could be a really powerful and easy way to bring dry-run-like behaviour to Snowflake I think. And it can come in several flavour. If you want to check your SQL works you'd do the where 1=0 statement and if you wanted some data you could call a sample. @drewbanin you suggested I make a new issue for this approach? Shall I do so we have any discussion linked to that particular approach in there and then I can get cracking with a PR?

drewbanin commented 5 years ago

@bastienboutonnet yeah! This one is interesting -- it's something you can already do in user-space, but it's certainly a workflow that dbt could make easier to implement. I have a couple of ideas here - happy to write them down in a new issue for sure! Maybe something like "More flexible control over selecting from sources", or similar?

bastienboutonnet commented 5 years ago

@drewbanin Yeah in user space is doable but it would require every query to have some checks a bit like the incremental macro right? I was more thinking of a solution around your wrapping SQL issue (https://github.com/fishtown-analytics/dbt/issues/1096). But happy to make it a different one. Whatever you think is best

github-actions[bot] commented 2 years ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.