delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.56k stars 1.7k forks source link

[Feature Request] Subqueries are not supported in the DELETE where predicate #730

Open CaptainDaVinci opened 3 years ago

CaptainDaVinci commented 3 years ago

Facing an error when using subqueries in where predicate while deleting. This code works fine on databricks but when running it on local machine it raises an error.

Sample code

from pyspark.sql import SparkSession
from delta import *

builder = SparkSession.builder \
                    .appName("test") \
                    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
                    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()

df = spark.createDataFrame([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
df.write.format('delta').save('/tmp/temp')
spark.sql('create table temp using delta location "/tmp/temp"')
spark.sql('delete from temp where _1 in (select _1 from temp)')

Error log:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/.virtualenvs/hulk/lib/python3.7/site-packages/pyspark/jars/spark-unsafe_2.12-3.1.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
:: loading settings :: url = jar:file:/Users/.virtualenvs/hulk/lib/python3.7/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/.ivy2/cache
The jars for the packages stored in: /Users/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c463827b-108c-4d2e-a2a4-35f41142bb39;1.0
        confs: [default]
        found io.delta#delta-core_2.12;1.0.0 in central
        found org.antlr#antlr4;4.7 in central
        found org.antlr#antlr4-runtime;4.7 in central
        found org.antlr#antlr-runtime;3.5.2 in central
        found org.antlr#ST4;4.0.8 in central
        found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
        found org.glassfish#javax.json;1.0.4 in central
        found com.ibm.icu#icu4j;58.2 in central
:: resolution report :: resolve 275ms :: artifacts dl 8ms
        :: modules in use:
        com.ibm.icu#icu4j;58.2 from central in [default]
        io.delta#delta-core_2.12;1.0.0 from central in [default]
        org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
        org.antlr#ST4;4.0.8 from central in [default]
        org.antlr#antlr-runtime;3.5.2 from central in [default]
        org.antlr#antlr4;4.7 from central in [default]
        org.antlr#antlr4-runtime;4.7 from central in [default]
        org.glassfish#javax.json;1.0.4 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   8   |   0   |   0   |   0   ||   8   |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-c463827b-108c-4d2e-a2a4-35f41142bb39
        confs: [default]
        0 artifacts copied, 8 already retrieved (0kB/11ms)
21/08/03 21:16:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.8ANTLR Tool version 4.7 used for code generation does not match the current runtime version 4.8Traceback (most recent call last):
  File "../test.py", line 14, in <module>
    spark.sql('delete from temp where _1 in (select _1 from temp)')
  File "/Users/.virtualenvs/hulk/lib/python3.7/site-packages/pyspark/sql/session.py", line 723, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/Users/.virtualenvs/hulk/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/.virtualenvs/hulk/lib/python3.7/site-packages/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Subqueries are not supported in the DELETE (condition = (spark_catalog.default.temp.`_1` IN (listquery()))).
tdas commented 3 years ago

Yes, this is a limitation. I dont remember the exact details because this was implemented a while ago, but it was an issue with how apache spark query plan support and handle subqueries. There were limitations in Apache Spark prevented us from implementing it efficiently. It could be that this has changed in later version of Spark, and this can be fixed.

If others reading this is also facing this issue, then please vote on this issue to raise its importance. And contributions are absolutely welcome.

CaptainDaVinci commented 3 years ago

@tdas how does it work on Databricks though? We're using a cluster with DBR 7.3 and the delete with subquery predicate works alright.

Edit: Alternatively, are there any docker images of databricks cluster that can be used instead? My basic requirement is to run integration tests in a pipeline which involve delta operations.

thomasheslinatmatches commented 3 years ago

Facing the same issue. Working in Databricks but not locally.

shirleycohen commented 3 years ago

Same issue.

wjxiz1992 commented 2 years ago

Same issue locally.

scottsand-db commented 2 years ago

Thanks for the comments/feedback everyone. For now, our H2 roadmap is quite full, so this is something that we can consider next year. Please keep the comments and feedback coming so we know how to prioritize items in our next roadmap!

cb-sukumarnataraj commented 1 year ago

Any plans to consider this year roadmap?. Looking forward to have this feature. Meanwhile any workaround for this feature?

vsadokhin commented 1 year ago

+1 for fixing/providing it in open-source delta lake

jeremyjordan commented 1 year ago

I'm also interested in this functionality, any chance we could get this prioritized on the roadmap?

anagha-google commented 1 year ago

@dennyglee When will this feature be released to OSS?

shrukul commented 11 months ago

+1, request to consider having this functionality in the year's roadmap.

ChrisOlivierCubed commented 11 months ago

+1 as well on this, Databricks runtime recently released the deletion vectors, to optimize the way that spark handles deletes (normally rewriting the data ex what you wanted deleted) by placing deletes in a separate file and excluding them from queries, later on doing the rewrite when it makes sense to do so. Perhaps considering this approach will overcome the original difficulties in implementing subquery deletion support as well, selfishly, it is for me to start moving processes over to Fabric, but I am sure that it is a feature that many people who come from a SQL world will love. Keep up the good work, it is greatly appreciated

Mourya1319 commented 8 months ago

+1, request to have this functionality. Because, delta lake doesn't generate DELETE FILES when we delete rows in a MERGE INTO query but only when explicitly deleted using DELETE FROM query. I am sure this will help a lot of people who wanted to create a MOR table in delta lake.

aleksandraangelova commented 1 month ago

Raising this as a request for functionality.