Closed jtcohen6 closed 4 years ago
All this sounds very reasonable and an interesting enhancement for the package. I am curious why there hasn't been more activity in this issue.
@dcereijodo would you be interested in submitting a PR for any of the functionality described here? I agree that these are compelling topics -- we'd happily accept contributions for any of them!
@jtcohen6 has in fact already made some great progress on the external table / partition front. This isn't totally ready for primetime yet, but it's pretty close: https://github.com/fishtown-analytics/dbt-external-tables
If you're not interested in contributing code, we'd also welcome any thoughts you have on these topics. Which of these macros would be most helpful to you? How would you like them to work? Do you have any noteworthy constraints that are worth considering when implementing code for this functionality?
The external tables stuff is what caught my attention actually. I have been working with external tables in RS for a while and we have developed some helpers in this field that might be interesting to contribute somehow. However, what we have is very similar to what @jtcohen6 did in https://github.com/fishtown-analytics/dbt-external-tables (which looks great BTW). We did follow a slightly different approach, so I will write my thoughts in an issue there in the dbt-external-tables
project.
I think Redshift is totally the right package for vacuum & analyze. They don't apply to Snowflake or Bigquery or the other adapters apart from maybe Postgres.
I would love to see how you are doing that with dbt. I have seen examples in post_hook but that seems too often, and I would also like to maintain my landing tables (dbt sources) that aren't owned by dbt, but are sourced by it.
My current approach is to use AWS Lambda to manage this maintenance, but it feels a bit out-of-band, I'd much rather use a dbt job for it.
My approach is to fetch a list of candidate tables, order them by size ascending, and only vacuum them if they are over an unsorted threshold, and analyze them if they are vacuumed or if the stats_off is above a threshold. I check the lambda ctx and if it is near to timing out (15 minutes) I exit and run again.
Get the candidate list
SELECT
"schema",
"table",
COALESCE("unsorted", 0::float) as unsorted,
COALESCE("stats_off", 0::float) as statsoff
FROM svv_table_info
WHERE 1=1
AND ( unsorted > ? OR stats_off > ? )
AND schema in (?)
ORDER BY SIZE ASC;
Optionally vacuum / analyse Yes Redshift supports the alias "analyse" :)
var sqls []string
for _, st := range sickTables {
if st.Unsorted >= config.MinUnsorted {
sqls = append(sqls, fmt.Sprintf("VACUUM FULL %s.%s TO 99 PERCENT;", st.Schema, st.Table))
} else {
log.Printf("Skipping vacuum on %s.%s", st.Schema, st.Table)
}
if st.StatsOff >= config.MinStatsOff {
sqls = append(sqls, fmt.Sprintf("ANALYSE VERBOSE %s.%s;", st.Schema, st.Table))
} else {
log.Printf("Skipping analyse on %s.%s", st.Schema, st.Table)
}
}
This is conceptually similar to building up a run queue, as you've done here https://github.com/fishtown-analytics/dbt-external-tables/blob/master/macros/external/stage_external_sources.sql#L17
I wouldn't have the hassle of Lambda's 15 minute timeout if using dbt. It would also be nice to see the vacuum / analyze logging in dbt cloud similar to the logging for running normal models.
dbt-external-tables
package is awesome and I have a similar use case for Snowplow, and impressed that the package supports both Redshift and Snowflake, thanks for contributing that.
Currently I use the Snowplow EMR loader and then some custom AWS Glue jobs to do the post-processing of downstream snowplow tables because we have too much data to use the Redshift loader to load directly into disk on our relatively small Redshift cluster. I use Spectrum with the Glue catalog over these tables so it all stays in S3 in partitioned parquet files. Unfortunately, because I want to keep it in Spectrum, it precludes the use of the snowplow dbt package.
I like spark but after wrestling with Glue for a while I have discovered I am not a fan of AWS Glue jobs, the broken job bookmarking, and spending way too much time fighting the glueContext wrapper over the spark context. I have had most success just dropping straight into native Spark ASAP.
I have thought for a long time that an Athena adapter would solve a lot of this, I would happily delete all the Glue jobs! Athena is way faster than Spectrum and using SQL via dbt for data processing is simpler, easier to debug and quicker to deploy than anything in Spark. Or I could just migrate to Snowflake.
Within my Glue jobs I call out to the boto3 API to refresh my table partitions right at the end of the job, perhaps the dbt-external-tables macro could replace that. If the Athena adapter gets underway then refreshing table partitions via Athena would also be useful to support, given Athena and Spectrum both share the Glue metadata catalog. It's a shame Redshift Spectrum can't write data back to external tables, surely they could integrate Redshift to use Athena behind the scenes.
Or can you directly unload an external table? It sound strange even saying that, and it would be painful for incremental models.
Hey @davehowell -- check out the redshift_maintenance
(operation)[https://github.com/fishtown-analytics/redshift#redshift_maintenance_operation-source] that we've already added to this package.
As you noted, we now have the external tables package. I think creating an Athena adapter is outside the scope of this issue!
I'm going to close this issue now -- sorry for any confusion!
I've now generated similar macro code to perform similar administrative operations in Redshift a handful of times:
In all cases, these are one-off operations that I schedule as separate runs in dbt Cloud, using a separate superuser connection/environment.
I would be interested in adding standardizing and open-sourcing this code, I'm just not sure if: