Open danfran opened 2 years ago
@danfran Thanks for opening the issue, and apologies for getting back to you here (and over in https://github.com/dbt-labs/dbt-spark/issues/215).
The reason we use show table extended in <schema> like '*'
is to build the adapter cache at the start of every run. This is how dbt can know which models already exist, and (crucially) whether they exist as a view or table, so that subsequent calls to get_relation
within materializations (e.g. here and here) can be cache lookups rather than redundant, slow introspective queries.
We found we need to use show table extended in <schema> like '*'
, instead of the much less verbose (and more performant) show tables in <schema>
, because show tables
does not include the relation type (view or table), only whether the object is temporary. On other databases, this kind of metadata is easy and fast to access, but Apache Spark lacks a real information schema.
(A neat side effect of having to use show table extended in <schema> like '*'
is that we can resolve get_columns_in_relation
from the adapter cache, when available, which we cannot do on other adapters. This is purely a bonus, though, and it would be worth it to give up this capability in exchange for faster, more reliable cache construction.)
I know that show table extended in <schema> like '*'
can be very slow, especially when accessing metadata for schemas with thousands of objects. This feels especially frustrating when most of those objects are not relevant to dbt! As a best practice, we highly recommend that you create dbt objects in a dedicated schema, separate from other processes. I understand it isn't always possible to isolate source tables, but this only impacts catalog querying (docs generate
) rather than cache construction (run
etc), since schemas with sources are catalogued but not cached.
Okay, now that we've gotten that background out of the way—I think you're onto something here:
So I can think ideally to replace the method:
show table extended in <schema> like '*'
with something like:
show table extended in <schema> like ('<table1>|<table2>|…')
I think this might just be doable!
There's a base adapter method, _get_cache_schemas
, that the adapter uses to identify which schemas it ought to cache. That list is derived from the enabled, non-ephemeral relations in the project. Normally, the list includes identifier-less relations, because we just need to run a single metadata query for the entire schema. The list of schema-only identifiers is then passed into the list_relations_without_caching
method, i.e. the list_relations_without_caching
macro, i.e. show table extended in <schema> like '*'
.
But what if we instead passed in a schema relation that did have an identifier? That identifier could look like '<table1>|<table2>|…'
, and the list_relations_without_caching
macro could use it to run a more targeted metadata query.
Any methods defined in BaseAdapter
can be overridden within SparkAdapter
, so even though _get_cache_schemas
isn't defined there now, we can reimplement it.
Here's my rough attempt:
def _get_cache_schemas(self, manifest: Manifest) -> Set[BaseRelation]:
"""Get the set of schema relations that the cache logic needs to
populate. This means only executable nodes are included.
"""
# the cache only cares about executable nodes
relations = [
self.Relation.create_from(self.config, node) # keep the identifier
for node in manifest.nodes.values()
if (
node.is_relational and not node.is_ephemeral_model
)
]
# group up relations by common schema
import collections
relmap = collections.defaultdict(list)
for r in relations:
relmap[r.schema].append(r)
# create a single relation for each schema
# set the identifier to a '|' delimited string of relation names, or '*'
schemas = [
self.Relation.create(
schema=schema,
identifier=(
'|'.join(r.identifier for r in rels)
# there's probably some limit to how many we can include by name
if len(rels) < 100 else '*'
)
) for schema, rels in relmap.items()
]
return schemas
Then, the macro becomes:
{% macro spark__list_relations_without_caching(relation) %}
{% call statement('list_relations_without_caching', fetch_result=True) -%}
show table extended in {{ relation.schema }} like '{{ relation.identifier or "*" }}'
{% endcall %}
{% do return(load_result('list_relations_without_caching').table) %}
{% endmacro %}
That works for me locally:
show table extended in dev_jerco like 'my_first_dbt_model|my_second_dbt_model'
_get_catalog_schemas
, to speed up docs generate
?dbt-spark
repo, since (I believe) that's where we can make all the code changes we need.add_schema_to_cache
method doesn't seem to be called/used anywhere. It might be a holdover from a long time ago. It's always a good idea to remove totally unused code.Hi @jtcohen6 thank you for the insights, very useful! I am more than happy to try what you suggested as seems working around the issue in a good way. I will also try to apply your other suggested points.
@danfran I am curious to know whether this worked for you.
@ssimeonov sorry for the late answer. Unfortunately my work conditions are changed since then and I was not able to test it in my original environment. However the solution proposed by @jtcohen6 still seems valid. If you are in an AWS environment and you want to test it (included the Athena view / EMR-Hive conflict that makes this fix even more useful) you need just a few tables in order to reproduce the issue and create a valid environment.
@jtcohen6 is there a strictly macro override way (without getting into python) to make the macro do as you suggested? I plonked something like this into the /macros
folder but didn't quite work (did a few variations, including having if execute
around {% for node in
).
{% macro spark__list_relations_without_caching(relation) %}
{% set rels = [] %}
{% for node in graph.nodes.values() | selectattr("resource_type", "equalto", "model") %}
{% do rels.append(node.fqn[1]) %}
{% endfor %}
{% if rels | length > 1 %}
{% set suffix = rels | join('|') %}
{% else %}
{% set suffix = '*' %}
{% endif %}
{% call statement('list_relations_without_caching', fetch_result=True) -%}
show table extended in {{ relation }} like {{ suffix }}
{% endcall %}
{% do return(load_result('list_relations_without_caching').table) %}
{% endmacro %}
I'm guessing surely not because you provided a _get_cache_schemas()
function + macro method :grin:
@jeremyyeo I think this approach would indeed require implementing _get_cache_schemas()
, i.e. a change to the python methods in dbt-spark—so there's no way to test this out with purely user-space code in the meantime.
An alternative approach that either I just remembered, or which just occurred to me, in the threads where we're discussing this internally:
show table extended
(verbose) over show tables
(concise) is because the concise version lacks a field/specifier that enables us to tell between views and tables. (Contrary to the nomenclature, show tables
includes both views + tables.)show views in <database>
+ show tables in <database>
. Then, we can reasonably infer that every object returned by the former is a view, and every object returned by the latter and not in former is a table.get_columns_in_relation
from the cache if available.It might be worth experimenting with both approaches, and seeing which one yields greater benefits. Does the slowness have more to do with the verbosity of show table extended
(contextual information we don't strictly need)? Or more to do with trying to return metadata for thousands of extraneous objects that dbt doesn't care about, but which happen to live alongside relevant objects in the same database/schema?
Does the slowness have more to do with the verbosity of show table extended (contextual information we don't strictly need)?
Yes. SHOW TABLES
runs quickly.
After speaking a bit more about this with @superdupershant:
show tables
should indeed be very fast, but show views
is likely to be significantly slower at scaleshow table extended
requires pulling out more metadata than the simple show
command, filtering on the table name (like 'table_one|table_two|...'
) is more likely to yield a significant speedupThere are other performance boosts that the dbt-databricks
team is planning to take advantage of, enabled by the specificity of their underlying architecture. That feels like an appropriate approach for our two plugins going forward.
This aligns with our experience @jtcohen6.
Forgive my unfamiliarity with DBT architecture: how do capabilities "degrade" based on the quality of metadata available about assets a data production depends on?
I can think of (at least) several levels:
SHOW TABLES
can solve this)The reason for the question is that, while the Spark metastore is quite dumb, Delta tables have queryable history and recognizable versioned identifiers (which could be used as an indication of immutable data state).
Further, one could design a simple open metadata standard (say, a single property containing some JSON, implemented via managed table properties in Spark, for example) for getting the metadata DBT needs for advanced functionality.
@ssimeonov It's a really fair question!
get_columns_in_relation
) if needed to power dynamically templated model SQLref()
+ source()
calls. This doesn't require a database connection at all. However, for some databases (such as "bound" views in Postgres), dbt also queries for inter-object dependencies, so that its runtime cache can correctly handle the implications of drop ... cascade
. That's less relevant on most analytical databases, including Spark/Databricks.dbt source freshness
. This is comparable to docs generate
, which generates the "catalog" described in (2). It also checks freshness naively (select max(timestamp_col)
) — I know that, for Delta tables, it would make a lot of sense to access native metadata to answer the same question.In the future, I could see integrating both types of information (freshness + catalog) into other dbt tasks. Overall, I think dbt only stands to benefit from Delta tables that can provide all types of metadata more readily, quickly, and reliably.
Then the path forward seems to be the implementation of your original idea: to generate the like 'table_one|table_two|...'
from DBT's project information, is that right?
Yes, I think that's right. Let's aim to include this change in a v1.1 beta, and ask folks to test it out in their own projects to ensure it yields the desired speedups.
cc @VersusFacit: This is another cache-related performance question, similar to https://github.com/dbt-labs/dbt-snowflake/issues/83, where benchmarking (and anticipating edge cases) will be the name of the game.
The change proposed in this issue will significantly help in cases where dbt manages some objects, in large databases that also contain thousands of non-dbt objects. If we want to speed up cases where dbt manages thousands of objects, and a given invocation is only selecting a handful to run, we'll need to tackle the larger question of selection-based caching: https://github.com/dbt-labs/dbt-core/issues/4688
Adding another customer onto this issue. Seeing dbt docs generate
taking up to an hour or more on job runs.
Please msg if more info is required.
Hoping this change can get added into 1.1 as part of the base code!
It would be a significant benefit for us as we are using Managed Workflows for Apache Airflow (MWAA) and are unable to modify the core code as referenced here https://github.com/dbt-labs/dbt-spark/issues/228#issuecomment-942132634.
With AWS MWAA we provide the requirements.txt (dbt-core==1.0.0) but have no access to the shell to modify the core code after the fact and show table extended in <schema> like '*'
continues to take longer and longer and with our implementation show table extended is run for every model step.
Our implementation means we don't do a single dbt run --select <project>
as we wanted control over re-running failed models and that control to be handled via restarting the task inside airflow.
See https://www.astronomer.io/blog/airflow-dbt-1/ "A Better, Better Way"
Anyways, looking forward to changes that speed up "show table extended" for spark schema objects.
dbt uses show table extended like '*'
for two purposes today:
docs generate
)Cache population: We're tracking the investigation into the two Spark-specific possibilities noted above in #296. There are two branches with sample/experimental code, if you're able to try them out and let us know what you find. We're also thinking about cache speedups via dbt-core changes, e.g. https://github.com/dbt-labs/dbt-core/issues/4688.
Catalog generation: As a separate question, I think it would make a lot of sense to pass specific table info (show table extended like 'one|two|three|...
) when grabbing metadata from the catalog. The stakes of missing an object are lower, and source tables/schemas are included. This will require code changes different from the ones sketched out above / in the branches linked by #296. I've opened a new issue to track that work: https://github.com/dbt-labs/dbt-spark/issues/300
@jtcohen6
Thank you for the quick adjustments and 2 branch options!
Sorry for the delay, testing took longer than I thought.
I was able to complete testing and below are my findings and some related context around the testing. Additional Context & Final Thoughts to follow below testing/findings.
show table extended in <schema> like '*'
show table extended in <schema> like '*'
dbt docs generate
can suffer from this single call, especially in the schema/project cases above
show table extended in <schema> like '*'
can also be quite lengthy inside schemas that are quite large and/or contain many objects not related to the project/model docs is being generated for/onshow table extended in <schema> lilke '*'
from the spark cluster as well as DBT command line for both DBT Compile & DBT Run
show table extended in <schema> like <relation object|relation object|relation object|....>
or as I will refer to it
show table extended in <schema> like <list>
like <list>
for enumerationlike <list>
enumerationlike <list>
like <list>
"limit" testing - Tested with 1,578 Models
like <list>
was to largeshow table extended in <schema> like <list>
from the spark cluster as well as DBT command line for both DBT Compile and DBT Run
show tables in <schema>
show views in <schema>
followed by describe extended <objects>
for any "refs" inside the model.sql
Show Tables & Show Views in <schema>
is lightning fast
describe extended <object>
is quick as welldescribe extended
1 by 1 for these commands compared to like <list>|'*'
describe extended <object>
doesn't provide whether the column is nullable or notshow table extended in <schema> like <list>|'*'
and describe extended <object>
dbt_project.yml
absolutely has +file_format: delta
set though, maybe that's entirely why this passed my testingshow tables in <schema> & show views in <schema>
from the spark cluster as well as DBT command line for both DBT Compile and DBT Run
I understand this might not be the recommended or normal approach. Below is an example Screenshot of our implementation and notes following
DBT Run --select model.sql
model.sql
is every individual model in the graphshow table extended in <schema> like '*'
for every model (nearly 900+ times currently)I totally understand as well that our implementation might not be normal or recommended and is part, if not entirely, the reason we are experiencing problems. Of course, none of these are your problems but I greatly appreciate the work being put into making caching faster overall
show table extended in <schema> like '*'
show table extended in <schema> like <list>
like '*'
show tables in <schema>
show views in <schema>
followed by describe extended <objects>
for any "refs" inside the model.sql
describe extended table
approach indicated in your last paragraph here https://github.com/dbt-labs/dbt-spark/issues/295#issue-1149338024is_delta
describe extended <object>
to get these results, but could those then be cached?describe extended <object>
againdescribe extended
call must be accepted and that might be a sort of return back to the slower approach that you mention in your last paragraph here https://github.com/dbt-labs/dbt-spark/issues/295#issue-1149338024Again, thank you very much for the two options, and to everyone if you got this far kudos!
@jtcohen6 Thanks for your work on this, it is so cool for me.
I have a suggestion, in the wip-faster-caching-option1, would you consider passing the select models to the function instead of the manifest if the user selects some models.
I set up airflow like @ryanClift-sd, only one model is running in a task. It would be a significant benefit for this usage.
@ryanClift-sd Thanks for the amazing comment + experimentation! I agree, I'm leaning pretty strongly in the direction of approach 2 (show tables
+ show views
). It performs much better at scale, it's more consistent with the ways we're thinking about caching everywhere else, and the trade-offs are much smaller.
This isn't work we've been able to prioritize yet. If you (or anyone else) would be interested in giving a PR a go, I can offer what support I can to see it over the finish line that much sooner. I see you've already got a PR merged on your fork!
There's follow-on core work around:
SchemaSearchMap
to support catalog queries that filter down to the object level, rather than the schema level (https://github.com/dbt-labs/dbt-core/issues/4997, https://github.com/dbt-labs/dbt-spark/issues/300). The stakes of "missing" one object when cataloging are much lower than when caching.@jtcohen6 Thank you for your quick response, sure, I see that the 1.1.0 version was able to provide some performance optimization for the way I use it.
For example, I have a database with 3 schemas: schema_a, schema_b, schema_c. I want to run schema_a.table1
now, my command is dbt run —select schema_a.table1
, before version 1.1.0, _get_cache_schemas()
will execute:
show table extended in schema_a like '*'
show table extended in schema_b like '*'
show table extended in schema_c like '*'
after the optimization of version 1.1.0, it will only execute:
show table extended in schema_a like '*'
I was wondering if it could go one step further and just query table1 information:
show table extended in schema_a like 'table1'
The advantage of this is that when I have a large schema(with thousands of tables in it) I can have a good performance improvement when I only run a small number of models.
Also, Option2 is great, I have deployed it in a production environment and it is working well. But there is one small problem, I found that the show table result does not have a database
column, it is called namespace
. It will cause an NPE error in code, this may be related to the Spark version, I am using version 3.1.2.
Finally, I am also interested in implementing this feature. Could you please evaluate whether the optimization of option 1 is feasible and better? If so, I will work on it; if not, I will work on option 2. Thank you very much for your time.
@jtcohen6 @TalkWIthKeyboard yes, Spark will use database
or namespace
based on the Spark version. dbt
code should expect and support either. The change was made for ANSI SQL compatibility reasons.
Thanks for the context there @ssimeonov. I agree, dbt should be able to pick it out as either database
or namespace
, whichever is available. I sense that this may be related to the larger changes coming in #281, which will require (optionally) rolling back some of the namespacing rules we've had in dbt-spark
to date, again depending on the Spark version.
@TalkWIthKeyboard Given the evidence we've laid out thus far, I think option 2 seems by far preferable. (Crucially, option 1 risks cache misses.) Even for schemas with many many objects, show tables
+ show views
seems fast enough to manage. If that's something you'd be interested in working on, I'd welcome a PR for it!
@jtcohen6 Sure, I’ll polish the code and submit a PR base on option 2. I’ve checked the Spark source code and confirmed that Spark will use database
or namespace
as the field name is a separate thing from #281.
The issue was imported from v3.0.2 because Spark updated the Catalog class in the v3.0.0 (it may be due to ASNI SQL compatibility, which I haven’t confirmed, but it does not affect the current issue). The tricky part is that we can’t get the Spark version and configurations through SQL (Databrick runtime can get config). so I have two options:
SET spark.sql.legacy.keepCommandOutputSchema=true
to the top of the spark__list_tables_without_caching
and spark__list_views_without_caching
marco, this would eliminate the impact of this issue and still use the database
access element, but override the user’s configuration directly doesn’t seem like a good thing and could affect the user’s judgment of the SQL result.Select existing ones from namespace
and database
:
if 'namespace' in tbl:
schema = tbl['namespace']
elif 'database' in tbl:
schema = tbl['database']
else:
logger.debug(f"Error while retrieving information about {schema_relation}: "
f"there is no database nor namespace field in the table")
return []
this solution will always work as long as subsequent changes don’t make the database
and namespace
appear at the same time. It may not look elegant enough, but it is very simple.
I prefer the second, please let me know if you have some good ways to get the Spark version and configurations.
Thanks @TalkWIthKeyboard!! Yes, I'm happy with the simpler if/elif/else solution. I don't imagine they're planning to change this again anytime soon.
Thanks @TalkWIthKeyboard!! Yes, I'm happy with the simpler if/elif/else solution. I don't imagine they're planning to change this again anytime soon.
OK, it is better.
Very interested in this fix -- @jtcohen6 @TalkWIthKeyboard do you think there's anything I could do to help with #433, or do you think it's best to let y'all handle it?
I'm working with a dbt Cloud customer who have an existing, long-lived Spark cluster and are keen to try dbt Core/Cloud with it. They have 10,000+ objects in Spark and are experiencing this issue. They are following along to see its resolution!
Is there any progress with a fix? The solution jtcohen6 suggested really benefits our use-case... Is there any estimations?
Per @smtaylorslc @theromsh @boxysean I'll add that our team has moved away from dbt Cloud because of this problem.
I ran into the same issue, it took ~1hr to run dbt docs generate
. Is there any fix for this?
While waiting for this to be implemented I decided to use @ryanClift-sd 's dbt-spark wip-faster-caching-option1 and stuck with the further problem: it seems that dbt-core
logic is broken and schema list isn't being generated for sources. While _get_cache_schemas()
correctly generate list from manifest, we end up in list_relation()
for the sources
claiming that
we can't build the relations cache because we don't have a manifest so we can't run any operations.
even thought the manifest definitely exists because it is being used by _get_cache_schemas()
.
After adding some logging:
I'm not sure whom this problem should be addressed to.
I assume it's probably related to changes in made after dbt-core/dbt-spark 1.0. When we had implemented this solution we were only using dbt-core 1.0 and dbt-spark 1.0 & never tested anything above that version. We stuck with dbt-core/dbt-spark 1.0 until very recently.
We have since upgraded to dbt-core 1.5 and luckily were able to move to dbt-databricks vs dbt-spark. Our implementation wasn't a self hosted spark instance, or aws glue spark, but databricks proper. At the time of dbt-core 1.0 the additional dbt-databricks package didn't exist.
dbt-databricks solved our issues with the relation retrieval and supported unity catalog which we were starting to leverage and needed to use in our dbt projects
I would like to report another issue on show table extended in schema_a like '*'
.
We use the OSS HMS with Spark3, and there are some Kudu, HBase tables registered in HMS, but there are no correspondingly serde jars under the Spark application classpath, thus the above query would cause failure like
Spark adapter: Database Error
Error operating ExecuteStatement: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.kudu.KuduStorageHandler
at org.apache.hadoop.hive.ql.metadata.Table.getStorageHandler(Table.java:297)
at org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$convertHiveTableToCatalogTable(HiveClientImpl.scala:482)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$3(HiveClientImpl.scala:434)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getTableOption$1(HiveClientImpl.scala:434)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:298)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:278)
at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:432)
at org.apache.spark.sql.hive.client.HiveClient.getTable(HiveClient.scala:95)
at org.apache.spark.sql.hive.client.HiveClient.getTable$(HiveClient.scala:94)
at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:92)
at org.apache.spark.sql.hive.HiveExternalCatalog.getRawTable(HiveExternalCatalog.scala:122)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$getTable$1(HiveExternalCatalog.scala:734)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:101)
at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:734)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableRawMetadata(SessionCatalog.scala:515)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:500)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTempViewOrPermanentTableMetadata(SessionCatalog.scala:710)
at org.apache.spark.sql.execution.command.ShowTablesCommand.$anonfun$run$43(tables.scala:859)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.execution.command.ShowTablesCommand.run(tables.scala:854)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
So, alongside the performance, I think the new approach should consider supporting table exclusion list
We have the same issue, but our problem is not focused on performance but rather on the loss of information. We are using iceberg's upsert functionality and, when this error occurs, dbt assumes that the models do not exist and instead of doing an upsert it executes a drop and create resulting in the loss of the previous information. Has anyone experienced the same problem as us? @jtcohen6 do you have any thoughs on this?
We are also facing the same issue with iceberg.
Ideally it should explicitly fail if the show tables
command fails instead of retrying.
The failure is an intermittent one and get resolved in the rerun. However, there is an explicit loss of data because of the create or replace
command.
We are facing this issue in dbt-spark 1.6.0
What @vinjai and @despewerba are telling its true, the same thing happened on databricks and they fixed it.
The same issue leading to data loss because of failed show tables
: #950
Iceberg does not support efficient listing because it is a V2 datasource in Spark. This will be fixed with the new 3.6 release. Until then, there is not much you can do (except making sure that you don't have too many tables in your schema).
@Fokko could you please provide some additional info (did't find any similar issues at https://issues.apache.org/)? I am wondering if the problem with iceberg the same as the problem with delta tables mentioned in the linked issue above.
Hello guys! I'm very interested in this topic. We have a very similar use, like the one @ryanClift-sd commented last year in https://github.com/dbt-labs/dbt-spark/issues/228#issuecomment-1073752972, where we run every dbt model in an Airflow task, so for every run we need to wait 1-2 minutes in order to discover the catalog. Do you have any news on how this behavior is going to improve? Do you know if this is better with Databricks Unity Catalog?
@daniel-gomez-mng I know for us, moving to the DBT Databricks adapter fixed our underlying spark adapter performance issues whether unity catalog or hive_metastore databricks tables. I understand this might not be an option for everyone though depending upon their environments.
Hi @ryanClift-sd , we are already using the Databricks adapter and we have the same performance issues. Just to give you more info, each dbt run is executing the following queries:
In our case, every show views query take 40-50 seconds. So, if we take into account all models, we are spending around 16 hours per day discovering this metadata. This is very expensive and 'unnecessary'.
We are analyzing this cache options in order to avoid recovering all metadata in all models, but please, if anyone knows how to improve this it would be very helpful.
Is there any update on this issue? For me show views
takes around 80 seconds, and since we are using 1-model-per-task strategy in Airflow we are having performance issues
Hi @AlanCortesK , we improved this behavior by adding this config to our Spark jobs. We use SQL warehouses in Databricks and a Hive Metastore in Glue, I don't know if it also your case
spark.databricks.hive.metastore.glueCatalog.enabled true
spark.hadoop.aws.glue.cache.table.enable true
spark.hadoop.aws.glue.cache.table.size 1000
spark.hadoop.aws.glue.cache.table.ttl-mins 60
spark.hadoop.aws.glue.cache.db.enable true
spark.hadoop.aws.glue.cache.db.size 1000
spark.hadoop.aws.glue.cache.db.ttl-mins 60
Describe the feature
I am currently facing an issue using DBT with Spark on AWS/Glue/EMR environment as discussed already in https://github.com/dbt-labs/dbt-spark/issues/215 (but already raised here https://github.com/dbt-labs/dbt-spark/issues/93).
The current issue is about the adapter's method
list_relations_without_caching
:https://github.com/dbt-labs/dbt/blob/HEAD/core/dbt/include/global_project/macros/adapters/common.sql#L240
which in the Spark Adapter implementation is:
https://github.com/dbt-labs/dbt-spark/blob/a8a85c54d10920af1c5efcbb4d2a51eb7cfcad11/dbt/include/spark/macros/adapters.sql#L133-L139
In this case you can see that the command
show table extended in {{ relation }} like '*'
is executed. It will force Spark to go through all the tables info in the schema (as Spark has not Information Schema Layer) in a sort of "discovery mode" and this approach produces two main issues:1) Bad performance: some environments can have hundreds or even thousands of tables generated not only by DBT but also by other processes in the same schema. In that case this operation can be very costly, especially when you have different DBT processes that run some updates at different times on a few tables.
2) Instability, as I verified in AWS/Glue/EMR environment, where you can have views without "S3 Location" defined, like an Athena/Presto view, that will make crash a DBT process running SparkSql on EMR with errors like:
Describe alternatives you've considered
I do not see the reason why DBT process should care of the "rest of the world" like the Athena views from before or tables created from other processes that are in the same schema.
So I can think ideally to replace the method:
show table extended in <schema> like '*'
with something like:
show table extended in <schema> like ('<table1>|<table2>|…')
where my
<table1>
,<table2>
, etc. are determined automatically when I run a command likedbt run --models my_folder
where
my_folder
contains the files:table1.sql
,table2.sql
, etcbut from the current method interface, only the schema params can be passed.
Two questions here: How can I infer automatically the name of the tables involved when a command like
dbt run --models my_folder
run and how can I pass them eventually to thelist_relations_without_caching
?Additional context
I found it relevant for Spark on AWS environment but can be potentially a similar issue for other implementations.
Who will this benefit?
On DBT's slack channel I talked to another used "affected" by similar issue, but probably whoever is going to use Spark in distributed environment can be affected by this (AWS and non).
Are you interested in contributing this feature?
Sure, both coding and testing.