Closed HarlanH closed 2 years ago
hey @HarlanH - in the early days of dbt, we had dbt run --dry
which did something similar to what you're describing here. It ended up not being that useful the way we implemented it, but I think it might be appropriate to give it another look.
I like your idea of using explain
. Alternatively, we could possibly materialize everything as ephemeral and run the select statements wrapped with limit 0
. TBD how query planners feel about. I think BigQuery might have a built-in dry-run
flag too.
So, this is definitely on our radar!
We're in pretty much the same boat right now - running 'dbt compile' in CircleCI doesn't quite check enough, and running the schema/data tests doesn't seem appropriate as an issue with the data shouldn't block a build. We've had multiple requests from our data science team to be able to do a 'dry run' of dbt while developing, and I'd like to add more stringent checks after the build.
Has there been any movement on this ticket?
hey @chrisdhanson - no movement on this in particular, but we are in the process of reshaping dbt's materializations to support custom workflows like this one.
At this point, it's not super clear to me what the desired behavior is here! Can you say a little more about how you'd like this feature to work? At Fishtown Analytics, we're in the habit of using GitHub hooks to do a full run/test of the project on every Pull Request. That's a little different than the workflow described in the comments above, but it works for us. Keen to hear your thoughts!
@drewbanin What kinds of tests are you running?
We're pretty early in our implementation but we'll be building models for an analytics database so being able to run test queries is going to be important. I can see in the built-in data tests that there are options for testing queries, but they seem to be mainly for internal consistency checks rather than verifying that the queries in the model sql files are themselves valid. Additionally, to run tests on generated tables/views, you have to generate them first and if you push changes to production that fail to build, your tests won't be much good.
For example, the default behavior for the data tests is to fail if rows are returned, but that means you can't run a test query with limit 0 if the error condition isn't a SQL Exception. There's probably a way to jimmy these tests into working that way, but truthfully I haven't explored it enough yet.
@chrisdhanson We do an entire run of the project into a sandbox schema (like pr_1234
) , then run our tests on the models in that schema to assert that our assumptions hold with the new code. When the PR is closed, we drop the schema.
We use Sinter, and in practice, it looks something like this:
And in GitHub:
I think the comments above are in regards to how to do this faster, as a full rebuild can take a long time for a mature dbt project.
As a disclaimer: Fishtown Analytics makes Sinter. I think it's a great tool that solves a lot of these problems, but there's no reason you couldn't replicate this workflow yourself using whatever tools you prefer. We're just listening for GitHub webhooks and running some dbt commands for the most part.
Hope this helps!
One feature we'd love is the ability to use Snowflake's TABLESAMPLE
when doing a dbt run
for CI checks. That way we could do real data tests but with just a fraction of our data to reduce both the time and expense associated with doing a full rebuild.
We'd love to see the ability to support BigQuery's dry run capability. This way of running a query does the following:
At Monzo, we use them extensively for validating that there are no syntax errors in queries. They're free, which is nice! 💰This would help us with fast CI runs.
I've got a small patch locally that seems to work. Obviously, this is entirely BigQuery specific, and horribly breaks other things like incremental tables.
adapters/bigquery/connections.py
def raw_execute(self, sql, name=None, fetch=False, dry_run=False):
conn = self.get(name)
client = conn.handle
logger.debug('On %s: %s', name, sql)
job_config = google.cloud.bigquery.QueryJobConfig()
job_config.use_legacy_sql = False
job_config.dry_run = dry_run
query_job = client.query(sql, job_config)
# there's no results for dry run queries, so we bail early
if dry_run:
return query_job, None
# this blocks until the query has completed
with self.exception_handler(sql, conn.name):
iterator = query_job.result()
return query_job, iterator
def execute(self, sql, name=None, auto_begin=False, fetch=None):
dry_run = os.getenv("DRY_RUN") == "true"
if dry_run:
fetch = False
# auto_begin is ignored on bigquery, and only included for consistency
_, iterator = self.raw_execute(sql, name=name, fetch=fetch, dry_run=dry_run)
if fetch:
res = self.get_table_from_response(iterator)
else:
res = dbt.clients.agate_helper.empty_table()
# If we get here, the query succeeded
status = 'OK'
return status, res
Since @sjwhitworth 's presentation at the dbt meetup and a few recent "oops I forgot to test" PRs. I decided I wanna get pretty strong CI.
I'd be more than happy to help with implementing Snowflake's TABLESAMPLE
so that our users can test that their SQL would run, a CI container could execute a dry run more rapidly etc.
The only caveat I foresee, is that when running tests from dbt doing this on a subsample may upset some of the tests. I haven't looked much into the materialisations of the tests. In Snowflake it seems that SAMPLE
is part of the query itself so my gut feeling is we'd have to add something at the end like the is_incremental
flag or make it an option in the materialisations.
@drewbanin do you have any suggestions so I can get started in the right direction maybe?
hey @bastienboutonnet - as I understand it, @sjwhitworth's approach does two things:
dry-run
mode on BQlimit
on BQThe dry-run
flag is BigQuery specific: it works like an explain
plan on Postgres/Redshift, validating a query and returning an estimated cost without actually running the query. I don't believe any sort of analog exists in Snowflake, and I found this comment in the Snowflake Lodge support site:
...Also, and perhaps more importantly, our execution plans can dynamically change based on the effectiveness of our pruning (micro-partition elimination) phase, so we do not know the actual execution plan until we have already started the query processing.
The whole post is worth reading, but the takeaway for me is that Snowflake doesn't currently provide any mechanisms for validating queries without actually running them.
The tablesample
example you sent over is interesting too, but I think about approaches like that a little bit differently than the dry-run
workflow. If sampling like this were enabled, dbt would still build objects in the target database, it would just happen faster than usual. I agree with your assessment though - this approach could definitely mess up tests on foreign keys (and possibly other types of custom tests too).
One currently viable approach could be to sample in the models that select from source data. Eg:
select *
from {{ source('ecom', 'orders') }}
{% if target.name == 'dev' %}
sample (100 rows)
{% endif %}
If you wanted to get really clever, you could make a macro called sampled()
which in turn calls source()
, applying a sample as appropriate. There are probably good ways to bake that right into dbt (either as a source configs, or via something like https://github.com/fishtown-analytics/dbt/issues/1096). Happy to explore those in a separate issue if you're interested in discussing!
@liveandletbri and @danieldicker implemented a SAMPLE(0)
macro for our Snowflake tables which then got invoked in a new temporary schema.
It worked in that it validated the schema, but the execution time was much longer than we anticipated even when appending that to each table so we haven't been using it yet.
Currently evaluating using CREATE TABLE ... LIKE
to create a new, empty database/schema to run the dry run off of or waiting for a Snowflake EXPLAIN
(which is apparently coming, but we don't know when).
@drewbanin I really really like your proposal. Indeed, great to clarify that a solution comparable to BQ's dry-run
is currently not 1 to 1 implementable in Snowflake, but I like the idea of speeding things up possibly when testing model executions, when for example doing things locally. Indeed, when testing assumptions with tests, there doesn't seem to be a viable way for tests not to run on a full model so for sure. However, there is a way to make the sampling be repeatable so if always placed on the source maybe actually tests wouldn't get messed up. This demands testing though -which I'd be happy to investigate no matter what.
@kforeman I was thinking of indeed doing something around that. But your point about " the execution time was much longer than we anticipated" maybe suggests the benefits for this are low. Can you elaborate a bit? What were you expecting?
What would be the advantage of create table like
in this situation @kforeman?
But your point about " the execution time was much longer than we anticipated" maybe suggests the benefits for this are low. Can you elaborate a bit? What were you expecting?
@bastienboutonnet I can speak to the performance issues. We created a macro that could optionally insert SAMPLE(0)
and placed it next to the name of every table in our query. For example:
WITH my_cte AS (
SELECT column_1
FROM foo.bar SAMPLE(0)
)
SELECT a.column_1, b.column_2
FROM my_cte as a
CROSS JOIN foo.another_table as b SAMPLE(0)
The thought was that putting in the limitations at the mention of each table, rather than at the end of the query in the form of LIMIT 0
, would speed the query up significantly since no data would flow through. Joins could be performed using zero rows, filters would be applied on zero rows, etc.
The problem is Snowflake still performs a table scan on the table before applying the SAMPLE(0)
filter, so large tables could still add a significant amount of time to the queries. Some queries that take 10+ minutes with the full dataset might still run for 2-5 minutes with SAMPLE(0)
applied.
That all being said, the large majority of queries are greatly improved by adding SAMPLE(0)
. It's only an issue on tables with hundreds of millions or billions of rows.
What would be the advantage of
create table like
in this situation @kforeman?
Sorry, should have been clearer. We're considering using CREATE TABLE ... LIKE
to produce empty versions of the upstream tables (our landing zone in Snowflake, where data is ingested from the production PSQL databases) that are outside of DBT. So we'd have empty source tables and wouldn't have to worry about the slow table scans that @liveandletbri mentioned. Would still use the regular table creation for DBT models.
So I found a really interesting thing, when you bring in a table into Tableau it runs a super fast and performant query to know the columns present in the table. It achieves this with the following:
select * from schema.table where 1=0
This translates in a very dry-run
like query: nothing gets scanned, nothing gets materialised the query is essentially all initialisation. So I recon that would be pretty powerful.
Sticking this to the source as @drewbanin suggested could be a really powerful and easy way to bring dry-run
-like behaviour to Snowflake I think. And it can come in several flavour. If you want to check your SQL works you'd do the where 1=0
statement and if you wanted some data you could call a sample. @drewbanin you suggested I make a new issue for this approach? Shall I do so we have any discussion linked to that particular approach in there and then I can get cracking with a PR?
@bastienboutonnet yeah! This one is interesting -- it's something you can already do in user-space, but it's certainly a workflow that dbt could make easier to implement. I have a couple of ideas here - happy to write them down in a new issue for sure! Maybe something like "More flexible control over selecting from sources", or similar?
@drewbanin Yeah in user space is doable but it would require every query to have some checks a bit like the incremental macro right? I was more thinking of a solution around your wrapping SQL issue (https://github.com/fishtown-analytics/dbt/issues/1096). But happy to make it a different one. Whatever you think is best
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.
(Note, not schema/data tests.)
We've got dbt set up with CircleCI, and right now pushes to our master branch create a new Docker image and push to production with no tests. This feels... risky. One thing we can do is to run
dbt compile
and check for errors, but it'd be great to get some simple testing of the SQL too. One option would be to compile and run the tables withEXPLAIN PLAN
instead ofCREATE TABLE
(and not to run post-hooks). That should be pretty fast, and SQL syntax (and some semantic) errors would be caught too.Would this make sense? Or something else?