Closed lordk911 closed 3 weeks ago
scala> spark.sql("insert overwrite dev.test.kshc_target partition(pcode,pdate) select logid,app_name,app_version,pcode,pdate from test.test_userlog_gux_func_add_hx2 where pcode='13073' and pdate>=20221025 and pdate<=20221030")
24/10/25 12:54:30 ERROR DatasetExtractor: class org.apache.spark.sql.catalyst.plans.logical.OverwriteByExpression is not supported yet. Please contact datahub team for further support.
res9: org.apache.spark.sql.DataFrame = []
The error is related to the number of partitions and the total number of files. If I continue to expand the time partition interval will report an error.
If I can keep that there is only one file in a partition, it will work
I've test against spark3.4.4, it can work out. It may be a bug about spark3.3.3, close issue.
this is a spark side issue, see details at SPARK-48484
I'm sorry to say that : 1、If I can keep that there is only one file in a partition, it will work (both spark3.3.3、spark3.4.4 on yarn or local) 2、if spark runs on local, no matter how many files in a partition, it will work, but will faild on yarn (both spark3.3.3、spark3.4.4)
bin/spark-shell --deploy-mode client --master local[*]
: write remote is ok
bin/spark-shell --deploy-mode client --master yarn
: only works when each partition only have one file , or will be failed with org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException)
if spark runs on local, no matter how many files in a partition, it will work, but will faild on yarn (both spark3.3.3、spark3.4.4)
failed with a "file already exists" error? or something else?
file already exists
yes, file already exists
error.
also spark 3.5.3
may be it cause by FileWriterFactory use DynamicPartitionDataSingleWriter. It need : records to be written are required to be sorted on partition and/or bucket column(s) before writing.
/**
* Dynamic partition writer with single writer, meaning only one writer is opened at any time for
* writing. The records to be written are required to be sorted on partition and/or bucket
* column(s) before writing.
*/
class DynamicPartitionDataSingleWriter(
description: WriteJobDescription,
taskAttemptContext: TaskAttemptContext,
committer: FileCommitProtocol,
customMetrics: Map[String, SQLMetric] = Map.empty)
Code of Conduct
Search before asking
Describe the bug
When I try to write data to a remote hive partition table got FileAlreadyExistsException:
Spark version: 3.3.3
KSHC :
sql:
CREATE TABLE test.kshc_target (c1 STRING, c2 STRING, c3 STRING) USING orc PARTITIONED BY (pcode STRING, pdate int);
spark executor error:
hdfs file status:
but when I choose on day data to write is ok:
Affects Version(s)
1.7.1
Kyuubi Server Log Output
No response
Kyuubi Engine Log Output
No response
Kyuubi Server Configurations
No response
Kyuubi Engine Configurations
No response
Additional context
No response
Are you willing to submit PR?